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Abstract 


Very-low  bandwidth  video-conferencing,  which  is  the  simultaneous  transmission  of  speech  and  pictures 
(face-to-face  communication)  of  the  communicating  parties,  is  a  challenging  application  requiring  an  inte¬ 
grated  effort  of  computer  vision  and  computer  graphics.  This  paper  consists  of  two  major  parts.  First,  we 
present  the  outline  of  a  simple  approach  to  video-conferencing  relying  on  an  example-based  hierarchical 
image  compression  scheme.  In  particular,  we  discuss  the  use  of  example  images  as  a  model,  the  number...,,., 
of  required  examples,  faces  as  a  class  of  semi-rigid  objects,  a  hierarchical  model  based  on  decomposition 
into  different  time-scales,  and  the  decomposition  of  face  images  into  patches  of  interest.  In  the  second 
part,  we  present  several  algorithms  for  image  processing  and  animation  as  well  as  their  experimental  eval¬ 
uation.  Among  the  original  contributions  of  this  paper  is  an  automatic  algorithm  for  pose  estimation  and 
normalization.  Experiments  suggest  interesting  estimates  of  necessary  spatial  resolution  and  frequency 

bands.  We  also  review  and  compare  different  algorithms  for  finding  the  nearest  neighbors  in  a  database . . 

for  a  new  input  as  well  as  a  generalized  algorithm  for  blending  patches  of  interest  in  order  to  synthesize 
new  images.  Extensions  for  image  sequences  are  proposed  together  with  possible  extensions  based  on 
the  interpolation  techniques  of  Beymer,  Shashua  and  Poggio  (1993)  between  example  images.  Finally,  we 
outline  the  possible  integration  of  several  algorithms  to  illustrate  a  simple  model-based  video- conference_ 
system. 
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Figure  1:  Simple  scheme  to  illustrate  a  video-conference 
system.  Only  one  direction  of  data  flow  is  indicated.  For 
transmission  in  the  reverse  direction  a  mirror  symmetric 
system  is  used.  The  dashed  arrows  between  the  video 
and  the  audio  channel  indicate  a  possible  coupling. 


1  Introduction  and  motivation 

Video-conferencing  requires  the  simultaneous  transmis¬ 
sion  of  speech  and  pictures  of  the  communicating  parties 
in  real  time  and  with  high  quality  (see  Figure  1).  We 
choose  this  apparently  technical  problem  as  a  paradigm 
to  investigate  new  concepts  related  to  the  representation 
of  3-D  objects  witiiout  explicit  volumetric  or  structural 
information.  In  particular,  we  pre.sent  a  hierarchical  ar¬ 
chitecture  for  niodel-lrased  image  compression  of  semi¬ 
rigid  objects.  HumaTi  faces  are  typical  representatives  of 
this  object  class. 

This  paper  consists  of  two  major  parts.  The  first  part 
outlines  our  approach  to  video-conferencing  (Section  3) 
and  it  is  introduced  with  a  brief  overview  of  work  on 
face  images  and  especially  face  recognition.  Section  4 
presents  a  specific  architecture  for  a  video-conference 
system  based  on  the  previous  approach.  Several  algo¬ 
rithms  for  image  processing  and  animation  are  described 
and  experimental  results  are  discussed.  A  novel  robust 
algorithm  for  automatic  pose  estimation  and  normaliza¬ 
tion  in  the  presence  of  strong  facial  expressions  is  pre- 
•sented  and  experimental  results  are  discussed  in  detail. 
At  the  end  of  Section  4,  possible  extensions  to  the  video- 
conferencing  scheme  by  interpolation  between  example 
images  arc  reviewed  and  discussed. 

2  Related  topics 

Conventional  model-based  image  compre.ssion  methods 
are  briefly  reviewed  to  emphasize  their  potential  prob¬ 
lems  and  to  ])oint  out  their  differences  from  the  approach 
prc.sented  here.  We  review  work  on  images  of  faces  and 
t  hen  focus  our  discussion  on  previous  work  on  face  recog¬ 
nition  that  is  of  interest  for  our  example-ba.sed  approach 
to  video-conferencing.  Finally,  we  list  specific  differences 
between  processing  for  face  recognition  and  for  video- 
conferencing. 


2.1  Conventional  and  model-based 
compression 

The  objective  here  is  not  to  give  a  comprehensive  survey 
of  image  compression  techniques,  but  rather  to  sketch  a 
few  key  ideas. 

Most  existing  image  coding  techniques  are  designed 
to  exploit  statistical  correlations  inherent  to  images  or 
to  time  sequences.  Conventional  waveform  coding  tech¬ 
niques  are  known  to  introduce  unnatural  coding  errors 
when  reproducing  images  at  very  low  bit  rates.  The 
source  of  this  problem  lies  in  the  use  of  statistical  correla¬ 
tions  which  are  not  related  to  image  content  and  cannot 
make  use  of  context  information.  However,  such  general 
techniques  are  very  successful  for  lower  compression  ra¬ 
tios.  Popular  examples  are  JPEG  (Joint  Photographic 
Expert  Group)  for  single  images  and  MPEG  for  motion 
sequences,  which  have  also  been  implemented  in  hard¬ 
ware.  JPEG  applies  the  discrete  cosine  transform  (DCT) 
to  blocks  of  8  X  8  pixel  and  yields  good  results  for  com¬ 
pression  ratios  up  to  about  1  :  2.').  The  algorithm  is 
symmetrical,  i.e.,  computational  costs  for  compression 
and  for  decompression  are  about  the  same. 

It  is  widely  accepted  that  only  model-based  compres¬ 
sion  schemes  have  the  potential  of  very  high  compression 
rates  while  retaining  high  image  fidelity.  In  model-based 
coding  schemes  the  sender  (encoder)  and  the  receiver 
(decoder)  contain  a  common  specific  model  or  special 
knowledge  about  the  objects  that  are  to  be  coded.  The 
encoder  analyzes  the  input  images  and  estimates  model 
parameters.  The  decoder  synthesizes  and  reconstructs 
images  from  these  parameters  using  the  internal  model 
or  knowledge.  With  this  kind  of  coding,  very  low  bit 
rates  can  by  realized  since  basically  only  the  analyzed 
model  parameters  are  transmitted. 

The  general  approach  of  such  model-based  coding 
techniques  —  being  still  an  active  research  topic  (cf.  [1, 
20,  44,  24,  25],  to  give  some  examples)  —  for  faces  is 
to  use  a  volumetric  3-D  facial  model  (such  as  polygo¬ 
nal  wire  frame  model).  Full  face  images  under  standard 
view  can  be  projected  onto  the  wire  frame  model  for  re¬ 
construction  by  using  texture  mapping  techniques.  Two 
major  difficulties  are  inherent  to  this  approach.  Firstly, 
the  generation  of  realistic  3-D  model  for  individual  faces 
is  very  difficult  in  itself.  Almost  all  available  automatic 
techniques  yield  either  poor  results  for  faces  or  are  not 
applicable  to  video-conferencing  because  they  require 
controlled  or  artificial  conditions  (structured  light  illu¬ 
mination.  la.ser  scanner,  marks  attached  to  the  face  sur¬ 
face,  etc.).  Secondly,  difficulties  arise  from  the  necessity 
to  register  the  3-D  model  and  the  images  used  for  tex¬ 
ture  mapping.  Also,  3-D  motion  parameters  have  to  be 
computed  preci.sely  from  image  data. 

2.2  Work  on  face  images  and  recognition 

A  good  survey  of  the  state  of  the  art  on  image  process¬ 
ing  of  faces  is  given  in  [12].  However,  most  of  the  work 
with  faces  in  computer  vision  was  done  on  processing  for 
recognition  [33,  6,  61,  36,  39,  3,  2,  4,  15,  28,  9]  and  some 
work  on  different  kinds  of  classification  tasks  [21,  22,  14], 
Almost  all  of  this  work  treats  face  recognition  as  a  static 
problem  approached  by  pattern  recognition  techniques 


1 


applied  to  single  static  images.  Only  recently  the  atten¬ 
tion  of  researchers  has  shifted  to  the  temporal  aspect  of 
facial  expressions  by  using  optical  flow  in  sequences  of 
face  images  [41,  42,  25,  65]. 

We  do  not  intend  to  give  a  comprehensive  overview  of 
face  recognition  here.  Rather,  we  will  summarize  some 
ideas  that  are  relevant  to  our  work.  Recently,  a  system¬ 
atic  comparison  of  typical  approaches  (feature-based  ver¬ 
sus  template-based  techniques)  to  face  recognition  was 
carried  out  by  Brunelli  &  Poggio  [13,  15].  Several  other 
approaches  to  face  recognition  have  also  been  presented 
(for  example  [6,  61,  36,  4]). 

The  first  approach  is  influenced  by  the  work  of  Kanade 
[33]  and  uses  a  vector  of  geometrical  features  for  recogni¬ 
tion.  First  the  eyes  are  located  by  computing  the  normal¬ 
ized  cross-correlation  coefficient  with  a  single  eye  tem¬ 
plate  at  different  resolution  scales.  The  image  is  then 
normalized  in  scale  and  translation.  By  independently 
localizing  both  eyes,  small  amounts  of  head  rotation  can 
be  compensated  by  aligning  the  eye-to-eye  axis  to  be  hor¬ 
izontal.  The  vertical  and  horizontal  integral  projection  of 
two  directional  edge  maps  (components  of  the  intensity 
gradient)  are  used  to  extract  features,  such  as  position 
and  size  of  nose  and  mouth,  as  well  as  eyebrow  position 
and  thickness.  Assumptions  about  natural  constraints 
of  faces,  such  as  bilateral  symmetry  and  average  facial 
layout,  are  used  at  this  stage.  A  total  of  35  geometrical 
features  are  extracted.  Recognition  is  then  performed 
with  a  Bayes  classifier  applied  to  the  vector  of  geometric 
features. 

The  second  approach  uses  template  matching  by  cor¬ 
relation  and  can  be  regarded  as  an  extension  of  the  pi¬ 
oneering  work  of  Baron  [6].  Images  of  frontal  views  of 
faces  are  normalized  as  described  above.  Each  person  is 
represented  by  four  rectangular  masks  centered  around 
the  eyes,  nose,  and  mouth,  respectively.  The  relative  po¬ 
sition  of  these  masks  is  the  same  for  all  persons.  The  nor¬ 
malized  cross- correlation  coefficients  of  the  four  masks 
are  computed  by  matching  the  novel  image  against  the 
database.  Recognition  is  done  by  finding  the  highest 
cumulative  score.  Some  preprocessing  is  applied  to  the 
grey-level  images  to  decrease  the  sensitivity  of  correla¬ 
tion  to  illumination  effects.  An  interesting  finding  is  that 
recognition  performance  is  stable  over  a  resolution  range 
of  1  :  4  (within  a  Gaussian  pyramid).  This  indicates  that 
quite  small  templates  can  be  used,  thus  making  correla¬ 
tion  feasible  at  very  low  computational  cost. 

Gilbert  &  Yang  presented  a  real  time  face  recognition 
system  using  custom  VLSI  hardware  [28].  Their  system 
is  based  on  the  template-matching  recognition  scheme 
outlined  by  Brunelli  &  Poggio  [13,  15]. 

In  most  of  the  work  with  faces  the  images  are  normal¬ 
ized  so  that  the  faces  have  the  same  position,  size  and 
orientation  after  manually  locating  the  eyes  and  mouth. 
Normalization  is  often  achieved  by  alignment  of  the  T 
spanned  by  both  eyes  and  the  mouth. 

2.3  Face  recognition  vs.  video- conferencing 

Since  much  work  has  been  done  on  face  recognition,  we 
want  to  make  use  as  much  as  possible  of  the  available 
experience.  On  the  other  hand,  there  are  several  signifi¬ 


cant  differences  between  the  problems  of  face  recognition 
and  of  video-conferencing.  These  issues  have  important 
implications  for  our  approach  and  will  be  discussed  in 
the  sequel  —  Table  1  summarizes  the  most  important 
differences. 

In  face  recognition  the  task  is  in  general  to  match 
a  new  face  image  against  a  gallery  of  stored  images  of 
different  persons.  The  low-level  processing  should  ex¬ 
tract  significant  features  and  a  subsequent  classification 
should  identify  the  appropriate  person  and  should  de¬ 
termine  if  there  is  a  good  match  in  the  database  at 
all.  Recognition  should  be  invariant  against  variations 
of  an  individual  face  (emotional  condition,  not  shaved 
recently,  etc,),  but  should  be  highly  discriminative  to 
differences  between  individuals.  To  achieve  this  goal, 
several  images  taken  under  different  views  and  illumina¬ 
tion  conditions  are  commonly  stored  in  the  database.  In 
many  applications  it  is  feasible  to  acquire  the  example 
images  under  (or  at  least  close  to)  standardized  condi¬ 
tions,  such  as  frontal  view  and  neutral  facial  expression. 
Moreover,  during  the  recognition  phase,  it  is  often  possi¬ 
ble  to  repeatedly  take  snapshot  images  until  one  comes 
close  enough  to  the  standard  conditions  (as  mentioned 
in  [28]  for  instance). 

A  different  paradigm  applies  for  applications  where 
recognition  is  used  for  validation  or  verification  only.  For 
instance,  in  an  access  control  system,  there  may  be  prior 
information  about  the  person’s  identity  available,  e.g., 
by  means  of  a  personalized  key  or  code-card.  Such  a 
system  has  only  to  decide  whether  the  match  with  a 
selected  entry  in  the  database  is  good  enough  in  order 
to  verify  the  identity  of  the  actual  person.  An  exhaustive 
search  for  the  absolute  optimum  of  the  cost  function  — 
that  is  commonly  used  in  recognition  —  is  not  feasible 
here.  Other  means  for  normalizing  and  thresholding  the 
cost  function  have  to  be  utilized.  Of  course,  the  rate  of 
false  positive  decision  should  be  very  low. 

In  video-conferencing  the  challenge  is  to  achieve 
high  fidelity  (high  resolution,  realistic  colors  and  quasi 
real  time)  transmission  between  communicating  parties, 
while  keeping  the  required  transmission  bandwidth  as 
small  as  possible.  A  reasonable  assumption  is  that  the 
identity  of  the  person  is  known  and  does  not  change 
throughout  a  session  (i.e.,  video-conference  call).  This 
assumption  allows  to  exploit  knowledge  and  examples 
accumulated  during  previous  sessions  of  the  same  per¬ 
son.  As  opposed  to  face  recognition  systems,  a  video- 
conference  system  must  be  very  sensitive  in  explicitly 
detecting  even  minor  individual  variations.  The  detected 
variations  (either  with  respect  to  the  previous  images,  or 
with  respect  to  similar  example  images)  have  to  be  pa¬ 
rameterized  and  coded  to  facilitate  efficient  transmission 
but  without  sacrifice  of  detailed  reconstruction. 

For  a  video- conference  system,  we  cannot  significantly 
restrict  the  range  of  admitted  head  poses  or  limit  the  va¬ 
riety  of  facial  expressions.  The  only  applicable  assump¬ 
tions  arise  from  the  physical  and  anatomical  limitations 
of  the  human  face,  head  and  body. 

Another  difference  is  rooted  in  the  more  passive  char¬ 
acter  of  our  task.  We  cannot  repeatedly  acquire  new  im¬ 
ages  until  we  have  a  suitable  one,  but  we  must  process 
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Table  1:  Tlie  most  important  differences  between  face  recognition  and  video-conferencing  (transmission  for  recon¬ 
struction  of  images  of  3-D  objects)  are  summarized  here. 


face  recognition 

video-conferencing 

task  is  comparison  and  matching 
against  (large)  example  databa.se 

task  is  "best”  possible  reconstruction 
with  small  channel  capacity 

examples  in  database  are  from  different 
faces 

all  examples  are  of  the  same  face 

discrimination  of  intra-individual  fea¬ 
tures  is  important 

additional  information  for  identification 
is  available 

should  be  invariant  to  individual 
variations 

individual  variations  are  most 

important 

roughly  standard  pose  of  all  faces 

no  a  priori  standard  pose  possible 

roughly  standard  facial  expressions 
possible 

all  facial  expressions  must  be  admitted 

deals  with  static  images 

qiiasi-continuous  sequence  of  images 

image  acquisition  can  be  repeated 

no  repetition,  real  time  requireil 

the  incoming  sequence  of  images.  Moreover,  we  cannot 
direct  a  person  to  behave  in  a  certain  way  (e.g.,  to  obtain 
approximately  standard  conditions)  —  as  is  possible  for 
recognition,  at.  least  while  building  the  database.  On  the 
other  hand,  the  high  sampling  rate  within  the  incoming 
image  sequence  allows  one  to  exploit  smootlme.ss  in  time 
(clue  to  the  inertia  of  ]diysical  objects).  Thus,  the  dif¬ 
ferences  between  successive  frames  are  small  and  finding 
correspondences  is  easier  than  for  recognition.  The  algo¬ 
rithms  need  not.  start  from  scratch  for  each  new  image, 
but  can  rely  on  predictions  derived  from  previous  images 
to  increase  stability  and  coding  efficiency.  We  have  good 
reasons  to  expect  that  real  time  oiieration  (video  frame 
rate)  can  be  achieved  in  the  very  near  future  with  an 
affordable  platform. 

3  Outline  of  the  approach 

In  this  section  we  outline  our  approach  to  video- 
conferencing.  Due  to  limited  space  the  scope  will  be 
restricted  to  five  topics  that  have  strongest  impact  upon 
the  syst.em  architecture  presented  in  Section  4. 

3.1  Examples  as  model 

Several  conventional  approaches  for  model-based  coding 
of  face  images  are  reported  in  the  literature.  Most  of 
them  rely  on  an  explicit  metric  3-D  model  (wire  frame 
model)  of  the  specific  face  [1, 44].  These  volumetric  mod¬ 
els  are  obtained  from  image  data  und('r  different  view¬ 
points  or  from  laser  range-scanners.  Shape-from-motion 
algorithms  are  known  to  be  not  very  stable  and  quite 
noise  sensitive.  Recently,  several  structure-from-motion 
algorithms  have  been  demonstrated  to  yield  good  results 
from  real  image  sequences.  However,  they  crucially  de¬ 
pend  on  stabh'  features  that  can  be  accurately  localized 
and  tracked  in  the  images.  In  face  images,  features  of 
this  kind  are  not  present  in  sufficient  number  or  quality. 
In  some  work  auxiliary  marks  (like  white  points  in  [1]) 
were  attached  to  the  person’s  skin.  While  such  aids  may 
be  useful  for  research  purposes  they  are  certainly  not  ac¬ 
ceptable  for  a  commercial  video-conference  system.  On 


the  other  hand,  laser  range-scanners  are  very  exiiensive, 
comparatively  slow,  and  difficult  to  handle  (due  to  sub¬ 
tle  scanning  mechanics).  The  problems  are  even  more 
severe  for  systems  that  have  aligned  video  cameras  to 
simultaneously  record  images  for  texture  map[>iug. 

In  contrast  to  the.se  more  conventional  concepts  for 
video-conferencing  we  want  to  avoid  the  detour  of  ex¬ 
plicit  3-D  models.  In  this  paper  we  advocate  a  model- 
ba.sed  coding  scheme  to  reconstruct  images  of  3-D  ob¬ 
jects  (e.g.,  faces)  directly  from  images.  The  model  is 
based  on  a  set  of  2D  example  images  of  a  person’s  face. 
W  e  a.ssume  that  the  set  of  example  images  is  acquired  at 
the  beginning  of  a  session:  possibly  the  system  may  fall 
back  upon  an  example  database  from  previous  sessions. 
The  set  of  example  images  is  initially  transmitted  to  the 
receiver  (by  means  of  conventional  image  compression 
techniques).  During  the  subsequent  continuous  trans¬ 
mission  the  examples  are  already  stored  on  the  sender 
(encoder)  and  on  the  receiver  (decoder)  side.  Therefore, 
approaches  of  this  kind  are  also  called  “memory-based”. 

The  stored  example  images  span  a  high  dimensional 
space  of  different  poses,  viewpoints,  facial  expressions, 
and  also  illumination  conditions.  As  suggested  for  in¬ 
stance  by  Poggio  &'  Brunrlli  ([46])  object  images  can  be 
generated  by  interpolating  between  a  small  set  of  exam- 
ph'  images.  They  described  this  interpolation  in  terms 
of  learning  from  examples  [4-5,  47,  48]. 

To  accomplish  the  interpolation  for  image  reconstruc¬ 
tion,  two  different  approaches  are  conceivable.  Tlu'  first 
is  related  to  a  new  approach  to  conquiter  graphics  [46], 
This  method  has  been  proposed  to  synthesize  new  im¬ 
ages  from  examples  attributed  with  specific  and  given 
parameters.  For  instance,  the  image  sequence  of  a  walk¬ 
ing  person  can  be  interpolated  over  time  from  a  few  im¬ 
ages  showing  distinct  postures;  here  the  parameter  is 
simply  time  (cf.  [46]).  In  computer  graphics  we  can  inter¬ 
actively  choose  the  right  examples  and  tailor  the  param¬ 
eterization  to  synthesize  images  that  are  close  enough 
to  what  we  want  by  interpolation  in  a  relatively  low¬ 
dimensional  parameter  space.  Some  applications  for  S]re- 
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cial  animation  effects  in  movies  can  also  be  found  in  [64]. 

The  second  approach  has  the  same  memory-bcised  fla¬ 
vor  and  is  in  fact  provably  almost  equivalent.  The  first 
step  is  to  find  the  nearest  neighbors,  i.e.,  the  most  similar 
examples  according  to  some  appropriate  distance  mea¬ 
sure,  to  the  novel  view  within  the  database.  The  follow¬ 
ing  step  may  then  estimate  the  weight  for  each  neigh¬ 
bor  to  yield  the  best  interpolation  (that  is,  the  clos¬ 
est  weighted  combination  to  the  novel  image)  between 
these  examples.  There  may  also  be  higher  dimensional 
cases  where  better  interpolation  can  be  achieved  if  exam¬ 
ples  other  than  the  nearest  neighbors  are  used  [10].  For 
the  purpose  of  video-conferencing  this  second  approach 
seems  more  natural.  It  is  in  fact  not  immediately  obvi¬ 
ous  how  details  of  facial  expressions  should  be  parame¬ 
terized  in  a  video-conference  system.  Moreover,  the  first 
method  requires  explicit  estimation  of  these  parameters 
in  addition  to  pose.  The  second  approach  avoids  the 
bottleneck  of  predetermined  parameterization,  but  se¬ 
lects  the  basis  for  interpolation  in  a  more  adaptive  way. 
Thus,  it  is  potentially  more  flexible  at  the  expense  of  a 
data-dependent  higher  dimensional  parameter  space. 

To  evaluate  the  feasibility  of  this  concept  in  the  pre¬ 
liminary  implementation  of  this  paper  we  will  consider 
only  the  nearest  neighbor  in  the  database.  Beymer, 
Shashua  &  Poggio  [10]  have  provided  a  preliminary  eval¬ 
uation  of  the  two  approaches  for  video-conferencing. 

3.2  Number  of  examples 

The  major  objection  to  the  example-based  approach  for 
a  video-conference  system  might  be  an  excessive  require¬ 
ment  of  memory  to  store  the  example  database  common 
to  the  sender  (encoder)  and  the  receiver  (decoder)  side. 

Due  to  today’s  semiconductor  technology,  however, 
fast  memory  is  affordable  in  abundance.  For  instance, 
standard  RAM  chips  with  16  MB  can  already  accommo¬ 
date  256  images  of  full  size  (256  x  256  pixels)  without  any 
further  compression.  Therefore,  storage  capacity  does 
not  cause  an  insuperable  problem. 

On  the  other  hand,  the  costs  for  initial  transmission 
of  the  examples  should  be  kept  as  low  as  possible.  Thus, 
an  interesting  question  is  to  what  extent  the  number  of 
examples  can  be  reduced  without  restricting  the  variety 
of  expressions  that  can  be  synthesized. 

Loosely  speaking,  one  can  think  of  a  high  dimensional 
space  that  is  defined  or  spanned  by  the  example  images. 
The  dimension  of  this  space  is  related  to  the  number 
of  distinct  face  images  that  can  be  generated.  In  other 
words,  we  want  to  reduce  the  number  of  examples  used  as 
nearest  neighbors  or  for  interpolation  without  reducing 
the  dimensionality  of  this  example  space. 

There  are  several  possibilities.  The  common  thread  is 
to  divide  the  abstract  example  space  into  lower  dimen¬ 
sional  subspaces.  This,  however,  relies  on  the  assump¬ 
tion  that  various  properties  of  face  images  are  separable, 
i.e.,  certain  aspects  of  the  images  are  to  a  large  extend 
independent  of  others.  For  reconstruction,  a  face  image 
can  then  be  composed  of  intermediate  results  obtained 
within  the  subspaces. 

Clearly,  the  concrete  separation  into  subspaces  must 
be  subjected  to  experimental  justification.  The  next  sec¬ 


tions  discuss  some  ways  to  reduce  the  number  of  exam¬ 
ples,  i.e.,  reduce  the  costs  for  initial  transmission  and 
storage,  while  at  the  same  time  preserving  the  dimen¬ 
sionality  of  the  example  space. 

3.3  Faces  as  semi-rigid  objects 

Human  heads/faces  are  representatives  of  a  special  class 
of  objects;  we  call  them  semi-rigid  objects.  The  name 
accounts  for  the  fact  that  these  objects  are  not  thor¬ 
oughly  rigid,  but  at  a  larger  scale  still  approximately  re¬ 
tain  their  shape.  A  prerequisite  is  that  a  decomposition 
of  the  object  dynamics  into  the  motion  of  the  object  as 
a  whole  (e.g.,  translation  of  center  of  mass  and  rotation 
about  an  axis  through  this  point)  and  the  variation  of 
the  object’s  shape  makes  sense.  Moreover,  the  dynamic 
range  of  variations  in  the  object  shape  is  small  compared 
to  the  total  object  extension.  In  other  words,  the  vari¬ 
ation  in  shape  can  be  formulated  as  a  perturbation  to 
a  standard  or  average  shape.  Since  the  object  shape  is 
subjected  to  variations,  we  cannot  expect  to  find  points 
on  the  object  surface  that  are  fixed  with  respect  to  any 
object-centered  reference  frame,  e.g.,  as  is  defined  by  the 
skull.  Therefore,  it  will,  in  general,  not  be  possible  to 
infer  the  object’s  position  based  on  observations  of  lo¬ 
calized  feature  points  on  the  surface. 

An  experimental  finding  for  face  images  is  that  facial 
expressions  as  well  as  detailed  features  and  the  overall 
shape  of  the  head  become  apparent  at  different  ranges 
of  spatial  frequency.  This  is  demonstrated  in  Figure  4. 
Based  on  this  observation,  we  conjecture  that  the  pose 
of  the  semi-rigid  head  can  be  estimated  from  the  low- 
frequency  images  alone,  thus  discarding  the  variations 
due  to  facial  expressions. 

In  face  recognition  commonly  labeled  feature  points 
like  eyes,  corners  of  the  mouth,  etc.  (see  Figure  3  for  il¬ 
lustration)  are  used  to  compensate  for  the  pose.  This  re¬ 
quires  detection  and  accurate  localization  of  correspond¬ 
ing  points  in  a  new  image  and  the  example  images  in  the 
database.  Reliable  detection  and  precise  localization  of 
such  predefined  and  labeled  features  in  one  image  are 
already  difficult.  Furthermore,  inferring  the  pose  from 
correspondence  of  such  feature  points  across  distinct  im¬ 
ages  is  prone  to  errors  in  the  presence  of  facial  expres¬ 
sions.  For  instance,  the  pupils  of  the  eyes  may  move  by 
more  than  1  cm  to  both  sides  due  to  changes  in  direc¬ 
tion  of  gaze  and  vergence  movements  as  is  depicted  in 
Figure  3.  Also,  the  pupils  may  temporarily  disappear 
when  the  eyelids  are  closed  during  twinkling  or  blinking. 
The  corners  of  the  mouth  are  rather  unreliable  feature 
points  for  pose  estimation,  since  they  are  subjected  to 
substantial  movements  with  respect  to  the  head  dne  to 
facial  expressions  and  during  normal  speech. 

To  circumvent  the  problems  sketched  above,  we  pro¬ 
pose  an  adaptive  strategy  for  robust  pose  estimation 
and  compensation  that  is  suitable  for  video-conferencing. 
Details  are  described  in  Section  4.1.  In  a  nutshell  the 
main  ideas  are:  to  make  use  only  of  the  low-frequency 
bands  within  a  coarse-to-fine  refinement  technique  to 
estimate  the  pose;  to  use  the  constant  brightness  con¬ 
straint  to  find  correspondence  for  all  image  points;  to 
rely  on  lower  level  criteria  to  select  adequate  correspon- 
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Figure  2:  Description  at  cliffercnt  time-scales  gives  rise  to 
a  liierarcliical  system  architecture  (see  text  for  details). 

deuce  points  based  on  a  confidence  measure;  and.  finally, 
to  use  robust,  statistics  to  estimate  model  parameters 
within  the  multiresolution  refinement. 

3.4  Description  on  different  time  scales 

In  what  follows  we  consider  the  object  model  on  different 
time-scales.  Based  on  this  decomposition,  we  derive  a  hi¬ 
erarchical  architecture  for  a  video-conference  system.  At 
least  three  different  time-scales  should  be  distinguished 
as  is  dejucted  in  Figure  2. 

Notice  that  the  general  idea  is  by  no  means  restricted 
to  video-conferencing,  but  carries  over  to  recognition. 
This  claim  is  reflected  in  the  three  lines  under  the  ar¬ 
row  in  Figure  2  leading  from  a  general  formulation  (top) 
to  the  specific  example  of  faces  and  video-conferencing 
(hot.tom). 

It  is  a  reasonable  assumption  that  the  object  class  a 
system  has  to  deal  wit  h  remains  the  same  over  a  "long” 
period  of  time  or  does  not  change  at  all.  Thus,  on  the 
slowest  time  scale  we  can  assume  that  wo  have  prior 
knowledge  about  the  object  class.  This  suggests  the  use 
of  a  generic  object  model  such  as  an  average  or  prototyp¬ 
ical  face.  Some  ajrju'opriate  model  assumptions  at  this 
level  are:  a  rough  model  of  the  3-D  shape  of  a  face/head 
(e.g.,  a  quadric  surface,  see  [53,  54]  for  details):  human 
heads  exhibit  approximately  bilateral  symmetry:  the  set 
of  constituents  of  a  face  (two  eyes,  nose,  mouth,  two 
ears,  etc.);  stable  features  of  constituents  (the  pupil  is 
round  and  darker  than  the  white  eye  bulbus,  teeth  are 
white,  the  holes  in  the  nose  are  dark,  the  relative  size  be¬ 
tween  diffi’rent  parts,  etc.);  the  geometric  arrangement 
of  constituents  (nose  is  between  mouth  and  eyes,  and 
vertically  centered  between  both  eyes,  the  ears  are  far 
ajmrt,  etc.);  the  variability  of  constituents  in  color  and 
in  geometry  (eyes  are  fairly  fixed  in  the  head,  location  of 
pupil  changes  wdth  direction  of  gaze,  height  of  an  open 
mouth  does  not  exceed  its  width,  etc.). 

On  the  intermediate  time-scale  we  are  concerned  with 
a  S]iecifir  representative  of  the  class.  The  object  model 
is  refined  by  accounting  for  individual  features  that  are 
s])ecific  to  a  ])erson’s  face.  Typical  features  that  are  fairly 
stable  are,  for  example:  the  3-D  shape  of  the  head  in¬ 
sofar  as  determined  by  the  skull  (excluding  the  region 
around  the  mouth,  of  course);  the  color  of  eyes,  hair, 
and  teeth;  to  some  e.xtent  also  the  taint  and  the  texture 
of  the  skin;  tyjrical  dynamics  of  facial  exprc.ssions  and 
miming,  defects  (e.g.,  scars  left  from  operations  or  acci¬ 
dents)  or  irregularities  (e.g.,  moles  or  stained  spots)  in 
the  appearance. 

Finally,  on  the  fastest  time-scale  we  have  to  deal  with 
the  variations  of  an  individual  object  instance.  Even 


for  many  non-rigid  objects  it  makes  sense  to  decomj'>os(' 
the  description  of  the  object  dynamics.  For  instance,  as 
the  motion  (translation  and  rotation)  of  a  local  reference 
frame  (e.g.,  center  of  mass  as  origin)  ^  and  a  description 
of  the  non-rigid  dynamics  with  respect  to  this  object 
centered  reference  frame.  For  the  video-conference  sys¬ 
tem  we  therefore  want  to  separate  the  estimation  of  the 
global  pose  from  the  dynamics  of  the  more  local  facial 
expressions. 

3.5  Decomposition  into  patches  of  interest 

A  further  way  to  reduce  the  number  of  examples  and  the 
storage  requirements  is  to  subdivide  the  normalized  face 
images  into  di.sjunctive  or  overlapping  subregions.  Such 
subregions  clipped  from  the  original  image  may  have  ar¬ 
bitrary  shapes  to  allow  for  maximal  generality.  They  are 
called  patches  of  interest  (POI)  in  the  sequel  —  as  op¬ 
posed  to  the  rectangular  regions  of  interest  (ROI)  com¬ 
monly  used.  This  concept  is  suitable  to  animate  fine 
facial  movements  and  facial  expression  by  blending  the 
subregions  together  to  reconstruct  a  composite  image  at 
the  receiver  side.  The  most  imi>ortant  POIs  are  located 
around  the  eyes  and  the  mouth.  Note  that  becaus('  of 
the  arbitrary  shape,  different  POIs  for  the  eye  pupil,  eye¬ 
lid,  and  eyebrows  could  be  used  as  well  as  different  POIs 
for  the  corners  of  the  mouth  and  teeth.  In  Section  4.3 
two  algorithms  to  find  the  nearest  neighbor  of  an  image 
region  are  described.  A  versatile  algorithm  for  blending 
possibly  overlapping  POIs  of  arbitrary  shape  together  is 
described  in  Section  4.4.1. 

Interestingly,  the  number  of  subimages  (pasted  into 
a  base  image  of  a  face)  needed  to  achieve  realistic  an¬ 
imation  is  often  surprisingly  small.  This  fact  has  also 
been  pointed  out  in  the  literature  (see  [24]  for  example). 
A  realistic  animation  of  an  eye  blink  can  be  achieved 
by  using  only  four  distinct  subimages  if  the  eyeball  is 
fixed.  Duffy  noted  that  only  five  different  images  are 
required  for  satisfactory  simulation  of  natural  eye  move¬ 
ments.  However,  this  should  be  considered  as  a  lower 
bound  to  achieve  realistic  animation  of  the  position  of 
the  eyeball  (without  interpolation  between  the  exam¬ 
ples).  Also,  some  examples  for  difierent  pupil  diameters 
will  be  required.  Moreover,  in  exiieriments  on  rq)-reading 
it  has  been  demonstrated  that  the  e.ssence  of  a  conversa¬ 
tion  can  be  picked  up  from  a  sequence  of  images  alone. 
Remarkable,  however,  is  that  a  set  of  only  19  visually  dis¬ 
tinguishable  images  (showing  particular  arrangements  of 
lip,  tongue,  and  teeth)  is  sufficient  [30],  Although  for  re¬ 
alistic  video-conferencing  a  larger  number  of  images  may 
be  required,  this  result  is  very  encouraging. 

The  advantage  of  having  distinct  example  patches  for 
different  regions  of  a  face  is  manifold.  First  of  all  in¬ 
stead  of  requiring  separate  example  images  of  th('  whoh' 
face  for  all  possible  poses,  lip-shapes,  eye  positions,  etc., 
a  face  can  be  reconstructed  from  a  smalh'r  number  of 
example  patches  for  subregions.  For  instance,  w('  want 
to  synthesize  various  face  images  with  eye  blinks  and 
speech.  Assuming  the  above  mentioned  number  of  ex- 

*  More  generally,  vve  may  admit  not  only  rigid  transforma¬ 
tions;  the  local  reference  frame  may  be  non-orthogonal  and 
time- varying. 
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ample  images  for  the  subregions  (let’s  say  20  different 
examples  for  the  mouth  and  5  distinct  images  for  each 
eye)  a  total  of  500  different  combinations  can  be  ob¬ 
tained.  Thus  using  only  31  (including  a  base  image  for 
the  whole  face)  examples  a  much  larger  number  of  face 
images  can  be  animated.  Notice  that  the  gain  in  the 
number  of  possible  combinations  will  increase  dramat¬ 
ically  the  more  we  can  subdivide  the  high  dimensional 
space  of  possible  faces  into  lower  dimensional  subspaces. 

In  addition  to  significantly  reducing  the  number  of 
needed  example  images,  the  memory  required  to  store 
the  POIs  is  much  less  than  for  the  whole  face  images, 
since  the  memory  required  for  a  POI  is  roughly  pro¬ 
portional  to  the  percentage  of  the  area  it  covers  in  the 
image.  For  the  images  shown  in  this  paper  (for  instance 
left  the  image  in  Figure  5)  this  yields  approximately  1% 
for  each  eye  POI  and  about  2-3%  for  the  region  around 
the  mouth. 

Another  benefit  of  utilizing  POIs  stems  from  geomet¬ 
ric  constraints  of  the  imaging  process.  The  subregions  of 
POIs  usually  will  be  projections  (perspective  projection 
in  general)  of  a  small  patch  of  the  3-D  face  surface.  For 
most  POIs  the  corresponding  3-D  patch  will  have  little 
depth  structure  and  therefore  can  be  well  approximated 
by  a  plane.  Each  POI  can  be  subjected  to  separate  trans¬ 
formations  that  account  for  the  global  head  pose  before 
the  subregions  are  blended  together.  Such  transforma¬ 
tions  can  be  affine  transformations  in  the  image  plane  or 
even-projective  transformations  of  a  plane  in  3-D.  Even 
if  the  3-D  surface  is  not  exactly  a  plane,  perspective  dis¬ 
tortions  and  occlusion  will  be  much  less  of  a  problem 
for  smaller  patches.  The  advantage  is  that  it  may  not 
be  necessary  to  have  example  subimages  for  all  the  va¬ 
riety  of  head  poses.  Rather,  only  a  few  samples  may 
be  sufficient.  This  is  because  intermediate  views  can  be 
generated  by  appropriate  transformations  with  good  fi¬ 
delity.  Thus,  fewer  examples  are  required  to  synthesize 
faces  with  the  same  variety  of  expressions. 

Probably  the  most  compelling  argument  for  decom¬ 
posing  face  images  into  subregions  has  to  do  with  the 
nearest  neighbors  procedure.  The  most  conspicuous  fa¬ 
cial  features  to  a  human  observer  cover  only  a  small  frac¬ 
tion  of  the  overall  face  image,  e.g.,  we  are  very  sensitive 
to  even  minor  variations  around  the  eyes.  Any  proce¬ 
dure  to  assess  similarity  between  face  images  that  relies 
on  the  whole  face  will  be  less  sensitive  to  those  local  vari¬ 
ation.  Take,  for  instance,  normalized  cross-correlation 
of  the  full  images.  The  correlation  value  is  likely  to  be 
dominated  by  small  illumination  differences  affecting  the 
whole  image  rather  than  by  local  variations  in  the  “sen¬ 
sitive”  regions  around  the  eyes. 

Notice  that  many  of  these  arguments  carry  over  to 
conventional  model-based  approaches.  Using  a  physical 
3-D  model  may  remedy  problems  of  perspective  and  oc¬ 
clusion.  But  still,  the  appropriate  texture  (e.g.,  mouth, 
eyes  open  or  closed)  to  be  mapped  is  required.  This 
texture  is  supplied  as  a  2-D  grey-level  or  color  image  in 
a  standard  view.  The  task  of  generating  this  image  is 
essentially  the  same  as  in  the  example-based  approach 
advocated  in  this  paper. 


4  System  architecture 

In  this  section  several  algorithms  that  have  been  de¬ 
veloped  so  far  within  the  novel  framework  for  video- 
conferencing  will  be  presented  and  discussed  in  more 
detail.  Before  doing  so,  we  will  sketch  a  possible  archi¬ 
tecture  for  a  very  simple  system.  The  intention  is  not  to 
present  a  working  system,  but  to  outline  its  major  com¬ 
ponents  so  that  the  developed  algorithms  can  be  seen  in 
an  appropriate  context. 

A  simple  system  architecture  should  comprise  the  fol¬ 
lowing  components: 

1.  compute  normalized  pose  of  new  facial  images  rel¬ 
ative  to  a  given  example  and  estimate  pose  param¬ 
eters 

2.  find  nearest  neighbors  within  an  example  database 
of  subimages,  e.g.,  regions  around  eyes,  mouth, 
nose,  etc. 

3.  transmit  model  parameters,  such  as  pose  parame¬ 
ters  and  index  numbers  of  the  nearest  neighbors 

4.  reconstruct  the  face  image  on  the  receiver  side  by 
blending  the  regions  of  the  subimages  together 

5.  transform  the  composed  face  image  into  the  pose 
of  the  original  image  on  the  sender  side 

A  desirable  extension  to  this  simple  scheme  is  to  “inter¬ 
polate”  each  subimage  between  several  suitable  examples 
(see  [10]).  Also,  adequate  ways  to  update  the  example 
database  automatically  have  to  be  devised. 

4.1  Automatic  and  robust  pose  estimation 

In  what  follows  we  describe  a  novel  algorithm  for  auto¬ 
matic  pose  estimation  and  normalization  of  new  face  im¬ 
ages  relative  to  a  given  example,  i.e.,  a  reference  image. 
Some  of  the  techniques  used  in  this  approach  to  pose  es¬ 
timation  are  somewhat  established  in  motion  estimation 
and  are  reviewed  briefly  for  the  sake  of  completeness. 
We  will  emphasize,  however,  the  original  parts  of  our 
algorithm,  based  on  the  discussion  of  Section  3. 

The  new  algorithm  can  be  sketched  as  follows.  Using 
a  restricted  affine  model  for  the  transformation,  four  pa¬ 
rameters  (translation,  fronto-parallel  rotation,  and  scale) 
are  estimated  in  a  correspondence  scheme  on  the  coarse 
resolution  levels  of  a  Laplacian  pyramid  only.  Local  con¬ 
fidence  measures  for  correspondence  are  used  as  statisti¬ 
cal  weights  during  least  squares  error  (LSE)  fit  of  these 
parameters.  Off-plane  rotation  is  handled  by  separate 
example  images.  The  error  between  the  unconstrained 
measured  displacement  vector  field  and  affine  motion 
transformation  at  higher  resolutions  can  used  to  assess 
the  similarity  between  both  images  (see  Section  4.3.2). 

4.1.1  Analysis  in  spatial  frequency  bands 

All  computation  is  performed  in  a  hierarchical  data 
structure  of  the  kind  originally  proposed  by  Tanimoto  & 
Pavlidis  to  speed  up  various  image  processing  operations 
[57].  A  comprehensive  overview  of  multiresolution  image 
processing  and  pyramid  structures  is  given  by  Rosenfeld 
[50]. 


For  eacli  image  we  compute  a  multiresolulion  pyra¬ 
mid,  where  I{x,y,1)  denotes  tlie  discrete  grey-level  im¬ 
age.  The  corrcs]iondence  algoritlim  can  he  applied  using 
either  Gaussian  or  Laplacian  pyramids.  We  compute 
these  pyramids,  adopting  the  algorithms  proposed  by 
Burt  [16]  and  Burt  &  Adcison  [17].  A  multiresolntion 
pyramid  is  a  stack  of  N  levels  of  progressively  smaller 
versions  of  the  original  image.  Let  /  denote  the  level 
within  the  pyramid  and  let  be  the  reduced  image  at 
tlie  /-th  level.  The  bottom  of  the  pyramid  is  the  original 
image  itself,  i.e.,  6'”  =  I.  The  image  array  G’*'*'’  at  a 
higher  level  is  a  lowpass  filtered  and  subsampled  copy  of 
its  predecessor  The  images  that  form  the  Gaussian 
pyramid  are  computed  recursively  by  applying  a  reduce 
operator  to  the  previous  level: 

G^"^^  =  reduce  G'^ 

This  procedure  is  iterated  until  the  highe.st  level  N  is 
reached.  The  reduce  operator  performs  a  convolution 
with  a  smoothing  kernel  and  a  suhseciuent  sampling  at 
every  second  image  locat  ion  in  G' ,  i.e.,  every  second  row 
and  column  is  skipped.  Thus,  the  size  in  each  direction 
of  the  image  array  reduces  roughly  by  a  factor  of  two 
between  successive  levels.  As  a  consecpience,  the  spatial 
resolution  and  image  size  slirinks  to  a  quarter.  In  our 
implementation  we  use  a  separable  and  symmetric  5  x  5 
smoothing  kernel.  Its  ID  coefficients  are  derived  from 
the  binomial  distribution,  e.g.,  ^(1,4. 6, 4,  1). 

A  Gaussian  pyramid  consists  of  an  ordered  set  of  low- 
pass  filtered  versions  of  the  original  image.  For  the  pre¬ 
viously  discussed  type  of  Gaussian  pyramids,  the  filler 
bandlimit  reduces  by  an  octave  (factor  of  two)  from  level 
to  level.  A  Laplacian  pyramid  may  be  regarded  as  a  stack 
of  bandpass  filtered  image  “incarnations’’.  The  name 
arises  from  the  fact  that  the  Laplacian  edge  detector  ■ 
commonly  used  in  image  enhancement  can  be  approxi¬ 
mated  by  the  difference  of  Gaussians  [40].  The  optimal 
ratio  —  the  one  that  leads  to  the  best  approximation 
—  of  standard  deviat  ions  for  inhibiting  and  excitatory 
Gaussian  about  is  1.6.  An  efficient  algorithm  to  compute 
a  pyramid  of  bandpass  filtered  images  having  a  ratio  of 
is  the  DOLP  transform  (difference  of  lowpass  trans¬ 
form)  propo.sed  by  Crcnrlcy  &'  Stem  [23].  However,  here 
we  construct  a  Laplacian  pyramid  from  the  difference 
of  images  at  adjacent  levels  of  the  Gaussian  pyramid  as 
proposed  by  Burt  &  Adelson.  Therefore,  the  ratio  of  the 
standard  deviations  is  2.0.  This  results  in  a  broader  filler 
bandwidth.  This  is  favorable  to  achieve  consistent  mo¬ 
tion  estimation  acro.ss  several  frequency  bands  (pyramid 
levels)  since  there  is  significantly  more  overlaj)  between 
adjacent  bands.  The  center  frequency  of  the  bandpass 
changes  by  an  octave  between  levels, 

Th('  levels  iJ  of  the  Laplacian  pyramid  can  be  gen¬ 
erated  from  the  Gau.ssian  pyramid  by  using  an  expand 
operator: 

L‘  =  G‘  —  expand  G'^"*"' , 

were  we  define  L^'  =  G'^'  for  the  highest  level  N .  The 
expand  operator  may  be  thought  to  do  basically  the  re- 

^Forinafly  this  is  the  Laplacian  operator  applied  to 
a  Gaussian  convolution  kernel  of  standard  deviation  tr: 


verse  of  reduce.  Its  effect  is  to  expand  an  image  ar¬ 
ray  to  an  array  having  twice  the  linear  extent  by 

interpolating  values  at  intermediate  locations  at  level  / 
between  given  sample  points  in 

The  upper  limit  for  the  storage  requirement  of  such 
a  pyramid  is  ^5.  were  S  is  the  memory  required  for 
the  original  image.  Moreover,  the  computational  costs 
incrca.se  linearly  with  S. 

4.1.2  Estimation  of  local  transformation  and 
confidence  values 

Let  us  suppose  that  we  have  two  similar  face  images  of 
the  same  person.  We  name  the  first  image  E  =  E(x,n), 
indicating  that  it  is  one  of  the  exam|)le  images  in  the 
database  {n  is  an  index  number),  and  a  second  new  im¬ 
age  I  =  /(x,/)  acquired  by  the  video  camera  at  time 
t  (with  X  =  (j-.t/)^  being  the  location  in  the  image). 
Our  task  is  now  to  bring  the  new  image  I  to  tin'  clos¬ 
est  possible  alignment  with  E.  Moreover,  we  want  to 
obtain  a  robust  estimate  for  the  transformation  parame¬ 
ters  despite  the  fact  that  both  facial  expressions  may  be 
significantly  different  (see  Figure  3  for  examples). 

Our  novel  algorithm  that  achieves  this  goal  can  be 
subdivided  into  two  main  steps.  Firstly,  we  describe  a 
differential  technique  for  local  alignment  of  small  image 
patches  in  two  images.  Secondly,  we  present  an  algo¬ 
rithm  that  fits  the  parameters  of  a  restricted  affine  model 
to  describe  the  global  transformation  between  the  two 
faces  due  to  different  poses  in  both  images.  Even  though 
the  discussion  here  uses  the  terminology  tailored  to  our 
video-conference  system,  the  results  and  algorithms  can 
be  generalized  to  other  problems. 

In  our  derivation  we  follow  the  lines  of  Lucas  G 
Kanade  who  first  proposed  a  differential  algorithm  for 
image  registration  and  stereo  [37,  38],  However,  more 
recently  related  techniques  have  been  presented  in  vari¬ 
ous  flavors  in  the  context  of  motion  estimation  or  optical 
flow  computation  [26,  31,  43,  32,  34,  2f),  55],  A  compre¬ 
hensive  survey  and  comparison  of  differential  and  other 
optical  flow  techniques  is  given  by  Barron,  Fleet  &'  Bcau- 
chetnin  [7],  Unfortunately,  they  do  not  consider  coarse- 
to-fine  methods  that  are  essential  to  extend  the  velocity 
range  of  differential  techniques. 

We  assume  that  at  a  sufficiently  low  resolution  both 
images  are  locally  similar  and  that  a  small  patch  in  I 
can  be  approximated  as  being  a  shifted  version  of  a  cor¬ 
responding  patch  in  E.  That  is:  /(x)  =  A'(x  +  fl(x)), 
were  d(x)  is  the  local  displacement  vector  that  we  want 
to  estimate.  In  the  case  of  optical  flow  techniques  E 
and  /  are  two  consecutive  images  taken  from  a  motion 
sequence  and  cl  =  v  •  At  dej)ends  on  the  instantaneous 
local  velocity  v  and  the  time  interval  At. 

In  order  to  align  both  image  patches  we  have  to  search 
for  the  displacement  d  that  minimizes  a  distance  mea¬ 
sure  between  the  patches  in  1  and  FJ.  A  typical  measure 
is  the  Li.-norm  of  the  grey-levels  over  a  certain  neighbor¬ 
hood  Q  centered  around  the  image  point  x.  Moreover,  we 
assume  that  d(x)  varies  only  smoothly  and  thus  can  be 
modeled  to  be  constant  over  D.  This  is  reasonabh'  since 
we  apply  this  procedure  to  bandlimited  images  anyway. 
In  addition  we  want  to  allow  for  a  weighting  function 


VF(x)  >  0,  which  gives  us  the  freedom  to  emphasize  the 
central  region  of  over  the  periphery.  In  order  to  find 
d  we  can  formulate  a  least  squares  problem.  Thus,  we 
want  to  minimize: 

e=  ^If^(x)||7(x)-i;(x  +  d(x))ll^  (1) 

xen 


We  approximate  i?(x  +  d(x))  by  a  Taylor  expansion 
truncated  after  the  linear  term  and  differentiate  the  error 
e  with  respect  to  d.  The  displacement  that  minimizes 
(1)  is  the  solution  of  the  equation  system 


Dd  =  c  ,  where  c  = 


X:ff^^(x)7.(x)AJ(x) 

EW^^(x)7,(x)A7(x) 


(2) 


and 

D  = 


X:H7^(x)72(x)  E^^W4(x)7,(x)  \ 

EW^^(x)7,(x)7,(x)  ) 


Here  we  have  introduced  the  abbreviations  A7(x)  = 
7(x)  —  £'(x)  for  the  difference  of  the  intensity  values  and 

=  57  jdx  for  the  partial  derivative,  for  ly  respectively. 
Indication  of  the  explicit  dependence  of  D,d,  and  c  on 
X  is  omitted. 

In  our  implementation  we  use  a  five  tap  central  differ¬ 
ence  mask  to  approximate  these  spatial  derivatives,  e.g., 
the  coefficients  are  j^(— 1, 8, 0,  — 8, 1).  This  is  a  reason¬ 
able  compromise  between  computational  cost  and  good¬ 
ness  of  the  approximation  provided  that  the  signal  is 
sufficiently  bandlimited.  The  spatial  neighborhood  is 
a  5  X  5  square  centered  around  the  actual  image  point. 

The  important  fact  is,  that  in  addition  to  the  esti¬ 
mated  local  displacement  d(x),  we  can  obtain  an  associ¬ 
ated  reliable  measure  of  its  correctness.  This  confidence 
measure  ^(x),  as  it  will  be  called  in  the  sequel,  is  used  in 
the  second  step  to  weight  each  displacement  vector  when 
we  fit  the  parameters  of  a  low  order  polynomial  model 
for  the  global  transformation. 

In  the  rest  of  this  section  two  questions  will  be  dis¬ 
cussed:  a)  what  are  the  solutions  of  (2),  and  b)  what  is 
the  optimal  confidence  measure  for  our  purposes.  Some 
of  the  following  items  have  been  addressed  in  individ¬ 
ual  papers  on  optical  flow  techniques,  but  we  think  it  is 
worthwhile  to  repeat  them  in  the  context  of  our  specific 
application. 

Note  that  the  matrix  D  has  two  important  properties 
that  will  be  exploited  in  the  sequel.  Firstly,  it  is  symmet¬ 
ric,  i.e.,  D  =  D^.  Therefore,  D  has  two  real  eigenvalues 
Ai,A2  G  m  and  the  corresponding  eigenvectors  are  or¬ 
thogonal  if  the  eigenvalues  are  distinct,  i.e.,  Ai  ^  Xn- 
Secondly,  D  is  positive  semi-definite  (the  quadratic  form 
xDx  >  OVx  G  IR"  and  x  7^  0)  as  can  be  verified  by 
Sylvester’s  criterion  [35].  Consequently  the  eigenvalues 
are  nonnegative  (Ai,A2  >  0).  The  eigenvalues  are  com¬ 
puted  as  the  roots  of  the  characteristic  quadratic  poly¬ 
nomial  in  our  implementation.  Let  Amin  =  min(Ai,A2) 
be  the  smaller  eigenvalue,  and  Amax  =  max(Ai,A2)  be 
the  larger  one. 

In  order  to  solve  (2)  for  the  displacement  d  three  dif¬ 
ferent  cases  have  to  be  distinguished: 

I.  If  det(D)  ^  0  the  inverse  exists  and  (2)  can 
be  solved  for  the  2-D  displacement  d.  However,  in 


practice  the  determinant  has  to  exceed  a  certain 
threshold  to  ensure  stable  results:  det(D)  >  ruet- 
If  the  matrix  D  is  singular,  i.e.  det(D)  =  Amin  • 
Amax  <  "biet,  we  must  distinguish  the  following  two 
cases. 

2.  If  Amax  >  0  A  Amin  —  Oj  i.6.,  Amax  ^ 

^max  A  Amin  <  "hnin  in  our  implementation,  we 
have 


det(D)  =  E  W^'(x)4'(x)  E  1^4x)44x) 

-  (EkF4x)4(x)7j,(x))^  =0. 


This  is  satisfied  if  7a;(x)  =  const  ■  Iy{x)  V  x  G  fl. 
The  interpretation  is  that  the  image  intensities 
within  the  region  fi  lie  on  a  plane  and  consequently 
all  spatial  gradients  have  the  same  direction.  This 
situation  represents  the  well-known  aperture  prob¬ 
lem  and  we  can  only  determine  the  normal  compo¬ 
nent  of  the  displacement: 


dn(x) 


A7(x)  V7(x) 

||V7(x)||||V7(x)||- 


3.  If  Amax  =  0  all  the  entries  in  D  are  zero.  The  sit¬ 
uation  Amax  <  Tmax  may  occur  in  practice  if  the 
image  does  have  insufficient  texture  within  the  re¬ 
gion  fi.  Consequently,  the  spatial  gradients  nearly 
vanish  and  we  cannot  determine  any  component  of 
displacement. 

The  confidence  measure  k{x)  associated  with  the  dis¬ 
placement  vector  d(x)  can  be  derived  from  the  entries 
in  the  matrix  D(x).  Several  ways  to  do  this  have  been 
proposed  in  the  literature: 

1.  Simoncelli,  Adelson  &  Tfee^rer  presented  a  Bayesian 
framework  for  optical  flow  computation  [55].  They 
emphasized  the  relevance  of  the  trace  of  the  spatial 
derivative  matrix  for  the  probability  distributions 
of  velocity  vectors.  Here,  we  have  for  the  trace  of 
D: 

tr(D)  =  Ai  -f  A2 

=  Y,  + Y  w 


2.  It  is  obvious  from  the  previous  discussion  that  the 
larger  det(D)  is,  the  more  stable  is  the  solution  of 
the  linear  system  (2). 

3.  Uras  et  al.  [63]  proposed  the  smallest  condition 
number  ®  k(H)  of  the  matrix  H  as  an  accuracy- 
criterion.  The  matrix  H  is  the  Hessian  of  image 
intensity  7(x,  t).  It  arises  from  an  optical  flow  tech¬ 
nique  using  second  order  constraints  to  recover  the 
2-D- velocity  locally  (see  also  [29]). 

Based  on  this  approach  Toelg  developed  a  refined 
and  robust  algorithm  that  is  used  in  an  active  vi¬ 
sion  system  [60,  59].  However,  in  extensive  exper¬ 
iments  the  magnitude  of  the  determinant  det(H), 

®The  condition  number  is  defined  as  the  ratio  between 
the  largest  and  the  smallest  absolute  eigenvalue  of  a  matrix 
(cf.  [49]).  A  matrix  is  ill-conditioned  if  its  condition  number 
K  is  too  large,  and  it  is  singular  if  k  is  infinite. 
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i.e.,  the  spatial  Gaussian  curvature  in  the  intensity 
image,  turned  out  to  be  a  better  confidtuice  mea¬ 
sure.  This  finding  is  in  accordance  with  the  more 
recent,  results  discussed  in  [7]. 

Here,  we  may  use  the  condition  number  of  the  ma¬ 
trix  D:  k(D)  =  Aiiinx/'^min  US  a  measurement  for 
reliability. 

4.  Here,  we  advocate  the  magnitude  of  the  smallest 
eigenvalue  An,i„  =  min(Ai,A2)  as  an  appropriate 
confidence  measure.  This  is  along  the  lines  of  the 
practical  results  reported  in  [7]. 

A  brief  jnstification  for  this  choice  will  be  given.  We 
are  only  int.erestecl  in  the  first  case  for  .solving  (2)  where 
the  full  2-D  dis]>lacement  vector  can  be  recovered  reli¬ 
ably.  Using  A|,ijn  as  a  confidence  measure  gives  a  lower 
bound  for  the  determinant  of  D,  since  det(D)  > 

Of  course  this  also  gives  a  lower  bound  for  the  trace  of  D, 
since  t.r(D)  >  2Amin.  Moreover,  a  larger  Amin  gives  rise 
to  a  condition  number  k(D)  closer  to  unity  in  the  imple¬ 
mentation.  This  is  because  Amax  is  bounded  from  above 
due  to  spatial  lowpass  filtering  and  due  to  the  limited 
range  of  intensity  values.  It  is  interesting  to  note  that 
for  any  2x2  matrix,  the  characteristic  equation  can  be 
written  as:  A"  —  Atr(D)  +  det (D)  =  0.  \A'o  conclude  that 
taking  the  magnitude  of  the  smallest  eigenvalue  Amin  as 
a  confidence  measure  implies  all  the  other  discussed  cri¬ 
teria. 

4.1.3  Fitting  global  parameters  for  the  pose 
model 


We  will  now  derive  the  second  step  of  the  pose  estima¬ 
tion  algorithm.  The  local  displacement  vectors  d(x)  and 
their  associated  confidence  measures  h{x)  are  used  to 
estimate  parameters  for  the  global  |iose  transformation 
model. 

We  assume  an  affine  model  for  the  displacement  vec¬ 
tor  field.  This  is  a  reasonable  assumption,  since  we  e.s- 
timate  the  pose  using  low  sjiatial  frequency  images  of 
the  face  only.  We  want  to  discard  facial  expressions  that 
have  very  little  effect  on  these  low  resolution  images  as 
discussed  in  Section  3.3.  We  W'ill  outline  only  the  basic 
idea  in  this  section.  The  reader  is  referred  to  Appendix 
A  for  mathematical  details. 

The  estimated  affine  displacement  field  d(x,)  is  deter¬ 
mined  at.  any  image  location  x,-  =  by  six  model 

paramet.ers  in  the  general  case: 

d(x,)  =  Ax,' +  t  (3) 


and  t 


(4) 


Suppo.se  we  have  n  image  points  x,  (/  =  1,  . . . ,  n)  with  a 
measured  disidacement  d(x,)  =  ), 

and  associated  confidence  measures  A’fx,-). 

In  general  (3)  cannot  be  satisfied  exactly  for  all  points. 
Instead,  we  want  to  find  the  parameter  vector  p  = 
{a.,.,  h.,.,  c,,.,  (ly,  by,  Cy)^  (cf.  equation  (14)  on  page  20)  that 
minimizes  the  error  between  the  measured  displacement 
field  d(x,  )  and  the  fitted  affine  displacement  field  d(x,  ). 


We  assume  Gaussian  statistics  of  the  process  and  use  the 
L2-norm  as  a  distance  measure.  So,  we  want  to  minimize 
the  sum  of  the  squared  differences  (SSD)  over  all  image 
points: 


where  we  assume  a  weight  irj  for  each  data  point  x/. 
These  weights  are  computed  as  the  values  of  a  monotonic 
function  trf  =  s(i’(x,  ))  of  the  associated  confidence  mea¬ 
sures.  The  function  .s(.)  must  be  nonlinear  to  decrease 
the  range,  which  turned  out  to  be  too  large.  We  utilize 
a  sigmoid-likf'  characteristics.  In  our  experiments  the 
choice  of  icj  —  \/b{xi )  work('d  very  well. 

The  solution  for  this  weighted  least  squares  problem 
is  found  by  the  weighted  pseudo-inverse  as  given  in  (18). 
In  the  general  case  this  leads  to  six  equations  for  six 
parameters  as  given  in  (19)  and  (20). 

The  affine  model  cannot  account  for  perspective  ef¬ 
fects  and  occlusions  such  as  occur  during  off-plaue  ro¬ 
tation  of  the  face.  This  kind  of  head  movement  must 
be  handled  by  separate  example  images.  Therefore,  we 
do  not  want  to  allow  for  any  comiionent  of  shear  and  re¬ 
duce  the  degree  of  freedom  in  the  affine  model.  To  admit 
only  translation,  scale  and  in-plane  rotation  in  the  model 
we  impose  additional  constraints  on  tin'  transformation 
matrix  (see  Appendix  A. 2  for  details): 

A  =  SR-I  (6) 

with 

S=  f  ^  M  ,  R=(  V  (7) 

\  0  $  J  \  —sin  a  COSO  J  ' 

I  is  the  identity  matrix,  .s  denotes  the  isotropic  scale  fac¬ 
tor,  and  ft  is  the  angle  of  rotation  in  the  image  plane. 
These  constraints  result  in  a  coupling  between  the  equa¬ 
tions  for  the  general  case.  The  system  can  be  reduced  to 
four  equations  (see  (26)  and  (27))  in  the  four  model  [la- 
rameters  aj.,ay.s,a  (see  (2.'))  on  page  21  for  definition). 

Further  simplification  can  be  achieved  by  introducing 
a  new  barvcentric  coordinate  system  (see  Appendix  A. 4). 
Til  e  origin  of  this  new  reference  frame  coincides  with  the 
weighted  center  if  gravity  of  the  image.  Expressed  in  this 
new  reference  frame  the  equation  systems  tak('  a  much 
simpler  form. 

For  the  sake  of  generality,  the  case  of  a  full  affine 
transformation  is  derived  first.  The  new  ecpiations  for 
the  general  case  with  six  free  parameters  are  given  in 
(37)  and  (38)  on  page  22.  They  can  be  directly  solvi'd 
for  the  translation  parameters  n'.,«y.  For  the  remaining 
parameters  only  two  decoupled  2  x  2  systems  have  to  b(' 
solved. 

However,  even  more  significant  is  tlu'  advantage  of  the 
new  reference  system  in  the  constrained  case  where  we 
do  not  allow  for  any  component  of  shear  (sei'  Appendix 
A. 4. 2).  We  obtain  a  4  x  4  equation  system.  As  it  is 
obvious  from  (42)  the  corresponding  matrix  is  diagonal. 
Hence,  the  system  can  be  solved  directly  and  the  solution 
can  be  written  in  closed  form  (see  (44)  -■  (46)). 

T1  le  size  of  the  fare  images  varies  because  of  changes  in 
the  distance  between  the  camera  and  the  pi'rson’s  head. 


cl(x,  )  -  cl(x,:)  ,  (5) 
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We  model  these  variations  by  a  scale  factor  s  in  the  im¬ 
age  plane.  Mathematically  this  is  only  correct  for  a  size 
variation  due  to  a  change  in  the  focal  length  of  the  lens 
while  the  distance  is  retained  (no  change  of  perspective). 
Since  we  apply  this  algorithm  only  to  low  spatial  fre¬ 
quency  images,  the  influence  of  perspective  distortions 
and  self-occlusion  is  mostly  negligible.  Our  experiments 
demonstrated  that  this  approximation  is  sufficient  for  a 
reasonable  distance  range  (about  ±20%  around  the  ideal 
position)  and  for  typical  ratios  of  the  focal  length  and 
the  viewing  distance. 

4.1.4  Hierarchical  control  structure 

The  pose  estimation  algorithm  is  embedded  in  a  coarse- 
to-fine  control  structure.  Computation  starts  at  the 
highest  level  N  within  the  Laplacian  pyramid  (cf.  Sec¬ 
tion  4.1.1).  At  this  level  with  lowest  resolution  all  model 
parameters  are  initialized.  On  each  pyramid  level  the 
following  steps  are  performed  in  sequence; 

1.  The  local  displacement  vector  field  d(x)  and  the 
associated  confidence  field  k{x)  are  computed  in 
the  way  described  in  Section  4.1.2. 

2.  The  global  transformation  parameters  of  a  con¬ 
strained  affine  model  are  estimated  by  the  weighted 
least  squares  fit  derived  in  Section  4.1.3  and  the  ap¬ 
pendix. 

3.  The  residual  affine  parameters  estimated  at  level  I 
and  the  parameters  propagated  from  a  higher  lever 
/  ±  1  are  combined.  As  is  shown  in  Appendix  A. 8 
the  refined  affine  parameters  at  level  I  are  simply 
obtained  as  the  sum  of  the  propagated  parameters 
and  the  residual  parameters. 

4.  The  estimated  model  parameters  are  then  propa¬ 
gated  from  a  higher  pyramid  level  /  ±  1  to  the  next 
lower  level  /.  The  relation  between  the  parameters 
is  derived  in  Appendix  A. 7.  The  result  is: 

A'  =  A'+^  and  t' =  2  •  (8) 

thus,  only  the  translation  vector  has  to  be  mul¬ 
tiplied  by  two.  The  coefficient  of  the  matrix  A'+^ 
are  retained. 

5.  At  the  current  level  /  the  original  image  is 

warped  according  to  the  propagated  model  param¬ 
eters.  The  warp  operation  remaps  the  intensity  val¬ 
ues  according  to:  7warped(^i0  =  4rig(^+‘iW>  ^)- 
Note  that  the  addresses  x  ±  d(x)  in  /p^jg  will  in 
general  not  coincide  with  integer  coordinates  of  the 
image  array.  The  intensity  values  in  /^arped  have 
to  be  interpolated  over  a  small  region  centered  at 
this  address.  Only  bilinear  interpolation  is  used  for 
the  bandlimited  pyramid  images.  The  warped  im- 

-^warped  i®  closer  alignment  with  the  example 
image  E. 

6.  Steps  1.  -  5.  are  repeated  on  successively  lower 
pyramid  levels,  i.e.,  on  successively  higher  fre¬ 
quency  bands  of  the  original  images.  The  process 
is  terminated  at  a  given  pyramid  level.  In  our  im¬ 
plementation  we  terminated  the  refinement  at  level 


2  or  3,  e.g.,  at  1/16  or  1/64  of  the  original  image 
size. 

7.  After  the  refinement  is  terminated,  the  estimated 
affine  parameters  are  propagated  (see  Appendix 
A.7)  to  the  bottom  level  of  the  pyramid  which  has 
the  same  size  and  resolution  as  the  original  image. 

8.  To  obtain  a  pose  compensated  face,  the  new  im¬ 
age  I  is  warped  according  to  the  pose  parameters. 
However,  here  we  use  bicubic  interpolation  (e.g., 
Lagrangian  interpolation  [11,  5]  or  bicubic  splines) 
for  the  warping,  since  we  do  not  want  the  high  fre¬ 
quency  components  of  the  original  image  to  be  sup¬ 
pressed  and  to  reduce  aliasing  effects. 

4.1.5  Experimental  results 

Using  real  image  data,  we  will  now  present  some  typi¬ 
cal  experimental  results  to  demonstrate  the  robustness 
and  versatility  of  the  algorithm  for  pose  estimation  and 
compensation. 

The  two  images  in  Figure  5  might  typically  occur  dur¬ 
ing  a  video-conferencing  session.  Here  the  left  image  is 
a  reference  image  that  would  be  stored  in  the  example 
database  of  standardized  face  images.  The  right  image 
represents  a  new  video  frame  acquired  during  the  session. 
Its  pose  parameters  have  to  be  estimated  with  respect  to 
the  reference  image  and  the  new  image  has  to  be  stan¬ 
dardized  in  pose  to  facilitate  further  processing.  Figure 
6  depicts  the  resulting  images  after  automatic  pose  com¬ 
pensation.  The  estimation  is  only  continued  up  to  a  final 
pyramid  level.  Subsequently,  the  parameters  are  propa¬ 
gated  to  the  resolution  of  the  original  image.  After  esti¬ 
mation  on  level  3  or  2  no  significant  change  is  apparent 
when  estimating  at  higher  resolution.  Table  2  gives  the 
values  of  the  corresponding  pose  parameters.  The  dia¬ 
grams  in  Figure  7  show  different  graphic  representations 
of  the  data  in  Table  2.  It  is  obvious  that  the  parame¬ 
ters  estimated  at  higher  levels  converge  to  their  bottom 
value  at  the  highest  resolution.  Furthermore,  the  largest 
adjustments  already  happen  at  the  higher  levels  having 
low  resolutions. 

Figure  8  shows  another  more  “difficult”  pair  of  im¬ 
ages.  In  addition  to  the  different  poses  of  the  faces,  there 
is  a  significant  difference  in  the  facial  expressions,  e.g., 
the  smile  in  the  reference  image  with  the  mouth  opened, 
the  teeth  visible,  shifted  corners  of  the  mouth,  and  dif¬ 
ferent  direction  of  gaze  as  compared  to  the  more  neutral 
right  image.  In  Figure  9  the  resulting  images  after  esti¬ 
mation  and  pose  compensation  up  to  the  final  resolution 
level  are  displayed.  By  visual  inspection  no  significant 
change  happens  after  level  2.  The  corresponding  param¬ 
eter  values  in  Table  4  as  well  as  the  diagrams  in  Figure 
10  confirm  this  finding. 

However,  it  is  notable  that  the  curves  are  not  mono¬ 
tonic  and  not  as  smooth  as  in  Figure  7.  This  suggests 
some  caution  and  a  closer  examination.  Figure  11  shows 
the  pose  parameters  obtained  at  each  resolution  level. 
The  four  diagrams  depict  the  results  after  an  increasing 
number  of  iterations  (1,  2,  3,  5,  respectively)  at  each  level 
before  proceeding  to  the  estimation  at  the  next  higher 
resolution  level.  Table  5  gives  the  corresponding  param¬ 
eter  values.  The  most  dominant  change  occurs  between 
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one  and  two  iterations  per  level.  This  ob.servation  in¬ 
dicates  that  one  pass  per  level  might  not  be  siiffieient 
to  achieve  the  best  possible  alignment  between  both  im¬ 
ages,  especially  when  dealing  with  more  “difficult"  im¬ 
ages  as  in  Figure  8.  For  visual  assessment  Figure  12 
shows  the  pose  compensated  images  for  different  num¬ 
bers  of  iterations.  Although  the  numerical  parameter 
values  for  one  and  a  larger  number  of  iterations  differ  — 
especially  for  vertical  translation  and  in-plane  rotation 
—  the  visual  appearance  is  quite  similar.  Instructive  is 
a  comparison  with  Table  3  which  represents  the  data  for 
the  more  similar  (in  facial  expression,  not  in  pose)  image 
pair  in  Figure  5.  Here,  the  parameter  variation  for  an 
increasing  number  of  iterations  per  level  is  imich  less. 

To  be  on  the  safe  side,  two  or  three  iterations  per 
level  will  improve  the  estimation  and  compensation.  Al- 
though  the  difference  might  not  be  noticeable  by  visual 
inspection,  it  will  be  beneficial  for  further  proce.ssing 
steps. 

To  demonstrate  the  range  of  variation  the  algorithm 
can  cope  with,  Figure  15  gives  images  that  have  been 
aligned  with  various  reference  images.  Only  two  exam¬ 
ples  with  strong  facial  expression  for  the  reference  images 
are  reproduced  here.  The  reference  images  used  in  Fig¬ 
ure  16  and  17  are  similar  to  the  faces  in  the  lower  row  of 
Figure  3,  which  are  likely  to  overtax  most  feature  based 
algorithms.  However,  the  results  are  quite  convincing, 
despite  the  fact  that  only  one  pass  (number  of  itera¬ 
tions  /  =  1)  has  been  performed  and  the  final  estimation 
level  is  /  =  2.  The  range  in  pose  is  mainly  limited  by 
border  effects  of  the  filtering  within  tlie  Lajilacian  pyra¬ 
mid.  Tlie  range  would  be  extetided  provided  that  the 
face  is  smaller  with  respect  to  the  image  size.  Notice 
that  the  ])ose  estimation  algorithm  is  symmetrical  with 
res]iect  to  both  images,  i.e.,  it  does  not  make  a  difference 
whether  the  new  image  or  the  reference  image  exhibits  a 
strong  facial  expression.  Moreover,  the  algorithm  works 
well  with  strong  expressions  and  deformations  in  both 
images.  More  examples  with  a  variety  of  differenl  face 
images  and  also  different  backgrounds  are  presented  in 
[58], 

4.2  Discnssioii  of  pose  estimation 

The  pose  estimation  and  normalization  algorithm  de¬ 
scribed  in  this  section  can  be  located  at  an  intermedi¬ 
ate  level  of  complexity  among  models  describing  global 
motions  of  a  face  in  images.  The  idea  is  to  make  some 
general  assumptions  about  the  object  that  give  rise  to 
a  parameterized  model.  This  model  will  be  valid,  i.e.,  a 
good  enough  approximation  for  our  purpose,  only  for  a 
limit.ed  range  of  poses  and  transformations.  Some  other 
linear  models  in  increasing  order  of  complexity  (/./»)• 

i.e.,  number  of  free  parameters  /  and  minimum  number 
of  corresponding  image  points  n  required,  are: 

1.  pure  translation  in  the  image  plane  (2,1) 

2.  translat  ion  and  rotation  in  the  image  plane  (3.2) 

3.  constrained  affine  transformation  (4,2)  -  the  model 
used  in  the  algorithm 

4.  full  affine  transformation  in  the  image  plane  (6.3) 
-  correct,  for  motion  of  a  ]rlane  under  orthographic 


projection 

5.  projective  transformation  of  a  plane  (8,4)  -  correct 
for  motion  of  a  plane  under  [)erspective  projection 

Rotation  in  the  image  plane  thoroughly  compensates  for 
rotation  of  the  head  in  space  about  au  axis  parallel  to 
the  optical  axis  of  the  camera  (in-plane  rotation).  Al¬ 
though  not  exactly  correct  in  the  mathematical  sense, 
in  practice  translation  in  the  image  plane  compensates 
for  translation  of  the  head  in  space  parallel  to  the  im¬ 
age  plane.  In  the  usual  situation  for  video-conferencing, 
however,  a  person  will  fixate  the  video  display  and  hence 
will  keep  the  direction  of  gaze  directed  t  oward  the  nearby 
camera.  Thus,  shifting  the  head  in  space  will  most 
likely  be  accompanied  by  a  compensatory  rotation  of 
the  head.  The  change  of  the  image  plane  scab'  factor 
can  —  to  a  reasonable  approximation  —  take  care  of 
distance  variations  between  head  and  camera;  the  re¬ 
quirements  of  weak  perspective"*  are  sufficiently  well  sat¬ 
isfied  by  standard  imaging  geometry.  However,  numer¬ 
ous  experiments  showed  that  more  complex  transforma¬ 
tions  do  not  yield  natural  appearing  face  images  and 
arc  very  difficult  or  even  not  accessible  for  subsequent 
processing  steps,  e.g.,  finding  nearest  neighbors.  For  in¬ 
stance.  allowing  full  affine  transformation,  the  additional 
components  of  shear  can  lead  to  severe  distortions  of  a 
face.  Similarly,  for  points  distant  from  the  plane  defin¬ 
ing  a  projective  transformation  severe  distortions  occur. 
Moreover,  occlusion  effects  become  very  obvious  even  for 
small  angles  of  off-plane  rotation  of  a  head  (e.g.,  rotation 
around  the  neck).  For  these  reasons  off-plane  rotations 
are  better  treated  by  separate  example  images.  Taking 
these  observations  into  account,  the  constrained  affine 
transformation  in  the  image  plane  is  a  good  compromise 
between  the  achieved  reduction  in  the  necessary  number 
of  example  images  and  the  image  fidelity. 

A  further  step  is  to  use  more  prior  structural  in¬ 
formation  (than  the  approximation  by  a  plane)  about 
faces/heads  in  order  to  increase  the  range  of  trans¬ 
formations  resulting  in  natural  looking  images.  R('- 
contly,  an  algorithm  has  been  presentf'd  that  applies  the 
model  of  a  general  quadric  surface  to  map  two  images  of 
faces  taken  under  perspective  projection  onto  each  other 
[53,  54],  Using  a  projective  framework,  a  constructive 
proof  is  given  that  in  general  nine  corresponding  refer¬ 
ence  points  and  the  epipoles  in  two  views  of  an  ajipro.x- 
imately  quadric  surface  determine  the  correspondences 
for  all  other  image  points  of  that  surface.  Encouraging 
results  have  been  achieved  for  real  face  images.  This  al¬ 
gorithm,  however,  still  requires  manual  selection  of  tlu' 
reference  points.  Also,  its  robustness  does  not  meet  the 
high  standards  of  the  simpler  algorithm  presented  in  this 
paper.  On  the  other  hand,  it  has  Ixm'u  demonstrated 
that  the  quadric  surface  model  can  successfully  deal  with 
small  amounts  of  off-idane  rotation  of  the  head  as  long 
as  occlusion  effects  are  not  too  severe. 

Although  the  images  used  here  have  a  fairly  homoge¬ 
neous  background  (some  structure  from  the  fabric  and 
due  to  illumination),  the  pose  normalization  algorithm 

‘‘For  weak  perspective,  the  depth  of  objects  along  the  line 
of  sight  is  small  compared  with  the  viewing  distance. 
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would  still  work  in  the  presence  of  a  moderate  amount 
of  textured  background  surrounding  the  face.  More  ro¬ 
bustness  could  be  achieved  by  a  preceding  segmentation 
step.  Suitable  techniques  for  segmentation  of  the  face 
from  the  static  background  are  readily  available,  such  as 
color  segmentation  [2,  4]  and  integral  projection  of  direc¬ 
tional  information  [15].  These  techniques  perform  grey- 
level  based  static  segmentation  of  single  images,  whereas 
motion  segmentation  in  front  of  the  static  background 
exploits  relative  motion  due  to  the  unavoidable  small 
jitter  or  movements  of  a  person’s  head  (a  robust  mo¬ 
tion  segmentation  algorithm  is  described  in  [60,  58],  for 
instance).  Combinations  of  these  techniques  should  be 
considered  in  order  to  achieve  greater  generality. 


4.3  Finding  the  nearest  neighbors 

In  order  to  find  the  nearest  neighbor(s)  in  an  example 
database  for  the  subimages  extracted  from  incoming  face 
images,  several  approaches  are  conceivable.  The  impor¬ 
tant  issue  here  is  the  similarity  measure  used  for  this 
assessment.  We  will  describe  only  the  two  ways  to  find 
nearest  neighbors  that  we  have  implemented. 


4.3.1  Template  matching  in  a  multiresolution 
hierarchy 

One  way  to  asses  the  similarity  between  images  or  subim¬ 
ages  is  based  on  template  matching  by  means  of  the  nor¬ 
malized  linear  cross- correlation  coefficient: 

r-  -  <  E-  <E»<  I-<I» 

—  _/ r?\ T\ 


a{E)a{I) 

<  El  >  -  <  E  ><  I  > 
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where  E  is  an  example  image  and  7  is  a  new  image  that 
is  already  normalized  in  pose.  <  .  >  denotes  the  average 
operator  and  (t(.)  the  standard  deviation  over  the  image 
intensity  values.  The  value  range  is  Cn  €  [— 1.0,  l.Oj. 
If  E  and  I  are  identical  we  have  complete  positive  cor¬ 
relation  Cn  =  1-0.  If  Cn  ^  0.0,  then  the  images  are 
uncorrelated.  The  use  of  this  standard  technique  is  sug¬ 
gested  by  the  good  results  reported  for  face  recognition 
by  other  researchers  (cf.  [6,  13,  28]). 

Our  implementation  performs  normalized  linear  cross¬ 
correlation  within  a  multiresolution  hierarchy.  Correla¬ 
tion  is  computed  between  corresponding  levels  of  Lapla- 
cian  pyramids  for  the  new  images  and  the  examples  for 
the  subregion.  Computation  starts  at  a  high  pyramid 
level  at  low  resolution.  Cross-correlation  is  performed 
for  a  small  window  of  horizontal  and  vertical  shifts.  For 
each  example  image  the  optimal  correlation  value  within 
the  shift  window  is  chosen  to  achieve  more  robustness 
against  small  distortions.  The  location  of  this  optimal 
correlation  is  propagated  to  the  next  lower  pyramid  level 
and  defines  the  center  of  the  shift  window  at  the  next 
higher  resolution.  The  size  of  the  shift  window  may  ei¬ 
ther  be  constant  for  all  pyramid  levels  or  may  increase 
with  the  spatial  resolution. 

The  results  obtained  so  far  are  encouraging.  Experi¬ 
ments  have  been  conducted  using  about  20  example  im¬ 
ages  either  for  the  whole  face  or  for  the  same  subregion 
around  the  eye.  The  new  images  (taken  from  a  mo¬ 
tion  sequence),  for  which  the  nearest  neighbor  had  to 


be  found,  were  similar  to  one  of  the  examples,  but  were 
not  included  in  the  example  database.  All  face  images 
were  previously  normalized  in  pose  using  the  robust  al¬ 
gorithm  presented  in  Section  4.1.  For  all  test  images 
the  hierarchical  template  matching  algorithm  picked  the 
image  as  a  nearest  neighbor  that  appeared  most  similar 
to  human  observers.  Robustness  against  small  residual 
shifts  between  the  images  is  achieved  by  choosing  the 
center  of  the  shift  window  according  to  the  optimum  at 
the  previous  level.  Instead  of  using  different  frequency 
bands  within  a  Laplacian  pyramid  we  also  tried  corre¬ 
lation  using  gradient-magnitude  images  within  a  Gaus¬ 
sian  pyramid  of  the  images.  This  kind  of  preprocessing 
before  correlation  has  been  reported  to  yield  superior 
recognition  results  [13].  There  was  no  difference  in  the 
chosen  nearest  neighbors.  However,  differences  in  the 
normalized  correlation  coefficients  were  less  pronounced 
for  images  with  added  random  noise. 

4.3.2  Fit  error  of  the  pose  model 

Another  approach  to  find  the  nearest  neighbors  is  more 
closely  related  to  the  automatic  pose  estimation  algo¬ 
rithm  described  in  Section  4.1  and  Appendix  A.  The 
general  idea  is  to  make  use  of  the  displacement  vector 
fields  between  a  new  image  and  the  example  images. 
The  most  similar  example  will  be  the  one  having  the 
smallest  sum  of  vector  magnitudes  taken  over  the  entire 
region.  To  be  more  specific,  what  is  used  is  the  remain¬ 
ing  displacement  vector  field  after  aligning  both  images 
according  to  the  constrained  affine  transformation  that 
takes  care  of  different  poses.  In  other  words,  the  infor¬ 
mation  used  to  asses  similarity  between  images  is  the 
sum  of  squared  differences  between  the  measured  dis¬ 
placement  field  and  the  fitted  affine  displacement  field 
at  each  pyramid  level  (see  also  (5)).  If  the  variation  be¬ 
tween  two  images  is  only  due  to  different  poses  as  defined 
by  the  affine  model,  then  both  displacement  fields  will  be 
identical  and  both  images  will  be  assessed  to  be  similar. 

Two  ways  to  compute  this  similarity  measure  have 
been  considered.  The  simplest  way  is  to  discard  the 
weights  assigned  to  each  displacement  vector  (express¬ 
ing  confidence  in  the  data)  and  to  compute  the  homo¬ 
geneous  error  of  the  fit.  The  formulas  for  doing  so  are 
derived  in  Appendix  A. 5.  The  second  way  is  by  comput¬ 
ing  the  weighted  errors  of  the  fit  as  derived  in  Appendix 
A. 6  for  the  three  different  model  cases.  The  error  can  be 
expressed  in  terms  of  the  estimated  model  parameters. 
For  the  weighted  error  only  two  additional  sums  over 
the  squares  of  the  measured  displacement  components 
have  to  be  computed  in  addition  to  the  terms  already 
computed  to  estimate  the  model  parameters.  This  facil¬ 
itates  an  efficient  implementation.  The  summed  errors 
still  have  to  be  normalized  to  account  for  different  im¬ 
age  sizes  or  for  the  individual  weights.  The  criterion 
used  to  asses  image  similarity  is  the  mean  deviation  of 
the  measured  displacement  field  from  the  fitted  affine 
displacement  field,  i.e.,  expressed  as  the  variance  (see 
Appendices  A. 5  and  A. 6  for  details). 

Some  experimental  results  of  this  approach  will  be  dis¬ 
cussed  now.  The  last  two  columns  give  in  Table  2  give 
the  weighted  and  the  homogeneous  variance  for  the  im- 
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age  pair  in  Figure  5;  so  does  Table  4  for  Figure  8.  These 
data  suggest  tlie  following  generalizations.  The  variances 
are  bound  between  a  theoretical  lower  limit  of  0.0  and 
an  upper  limit  of  about  1.5  that  is  due  to  the  gradient 
techniejne  used  to  com]nite  dis]dacement  vectors.  There 
are  two  exceptions.  Firstly,  the  weighted  variances  at  the 
highest  pyramid  level  are  usually  larger  than  at  the  next 
lower  level.  This  is  because  the  initial  pose  estimation 
is  not  accurate  enough  and  the  variance  is  dominated 
by  errors  due  to  misalignment.  Secondly,  at  the  high¬ 
est  resolution  the  variances  are  in  general  smaller  than 
atr  the  previous  level.  This  phenomenon  may  have  two 
explanations.  Either  the  images  are  very  similar  and 
the  alignment  is  significantly  better  at  highest  resolu¬ 
tion;  the  images  are  rather  different  and  the  computa¬ 
tion  of  the  dis]dacement  vectors  fails  because  the  conver¬ 
gence  range  of  the  gradient  algorithm  is  exceeded.  Tliis 
suggests  using  only  the  significant  intermediate  pyramid 
levels  to  asses  image  similarity.  Indeed,  comparing  the 
data  in  Tables  2  and  4  shows  that  the  variances  for  the 
significant  levels  are  always  smaller  for  the  more  similar 
images  in  Figure  5  that  for  the  images  in  Figure  8  which 
exhibit  rather  different  facial  expressions. 

Previous  result.s  (see  Section  4.1.5)  indicate  that  even 
better  alignment  can  be  achieved  at  a  given  pyramid 
level  if  more  than  one  iteration  of  the  pose  estimation 
algorithm  is  performed.  Table  3  and  5  give  the  variances 
for  mult.iple  iterations.  The  variances  tend  to  decrease 
slightly  if  more  iterations  per  level  are  done.  Although 
only  the  data  for  the  bottom  level  is  given  in  the  Tables, 
this  result  holds  also  for  intermediate  levels. 

4.4  Reconstruction  of  face  images 
4.4.1  Blending  patches  of  interest  together 
We  now  consider  the  problem  of  reconstructing  a  com¬ 
posite  face  image  from  a  patchwork  of  example  subre¬ 
gions. 

In  computer  graphics  texture  mapping  techniques 
have  been  applied  to  map  an  image  that  is  a  frontal 
view  of  a  face  onto  a  3-D  wire  frame  model  of  the  sur¬ 
face.  Quite  realistic  animation  of  facial  details  such  as 
eye  movements  or  speech  can  be  achieved  by  blending  se¬ 
quences  of  eye  and  mouth  subimages  into  a  base  image  at 
appropriate  positions  prior  to  mapping  (.see  for  instance 
[24]).  Both  base  and  subimage  are  static  frontal  views 
of  the  face. 

As  observed  by  Duffy  [24],  simply  pa.sting  a  snbim- 
age,  e.g.,  a  rectangular  region  of  the  mouth,  yields  fairly 
unsatisfactory  results  due  to  visible  discontinuities  at 
the  edges  of  the  pasted  area.  These  discontinuities  are 
caused  by  variations  in  brightness  and  color  between 
base  and  subimages  as  well  as  by  minor  changes  in  fa¬ 
cial  shape  (misalignment )  occurring  for  real  subjects.  To 
remedy  these  shortcomings  Duffy  proposed  a  transition 
zone  located  around  the  bounding  box  of  the  rectangular 
subimage.  Within  this  transition  zone  the  values  of  the 
composite  image  are  computed  by  weighted  averaging 
between  corresponding  values  of  base  image  and  subim¬ 
age.  A  weighting  function  having  linear  dependence  in 
position  is  applied.  In  this  way,  significantly  better  ani¬ 
mation,  i.e.,  less  spurious  effects,  can  be  obtained. 


However,  this  simple  way  of  blending  a  subimage  into 
a  ba.se  image  is  not  flexible  enough  for  our  dcmiands.  Let 
us  mention  only  its  most  important  shortcomings:  it  re¬ 
quires  a  rectangular  region  of  fixed  size  and  location,  the 
width  and  location  of  the  transition  zone  is  very  criti¬ 
cal  in  order  to  achieve  realistic  results,  and  there  is  no 
straightforward  generalization  to  many  (possibly  over¬ 
lapping)  subimages  because  of  geometrical  constraints. 

We  will  now  explain  a  more  general  algorithm  for 
blending,  i,e.,  seamlessly  merging,  several  image  regions 
to  form  a  composite  image.  The  essential  requirement  is 
to  preserve  important  details  of  the  individual  source  im¬ 
ages  without  introducing  artifacts  by  the  blending  pro¬ 
cess.  Two  factors  are  relevant  for  choosing  the  width  of 
the  transition  zone.  If  the  transition  zone  is  narrow  as 
compared  to  the  image  features,  then  the  boundary  will 
still  be  noticeabli'  in  the  composite  image,  although  it 
will  appear  blurred.  On  the  other  hand,  if  the  transition 
zone  is  too  wide,  then  features  from  several  source  images 
may  appear  superimposed,  similar  to  a  multi-exposure  iu 
photography.  These  conflicting  requirements  cannot  be 
fulfilled  simultaneously  in  general,  i,e.,  for  images  cover¬ 
ing  a  wide  range  of  spatial  frequencies.  A  suitabh'  tran¬ 
sition  width  can  be  found  only  if  the  spatial  frequency 
band  of  the  images  is  relatively  narrow. 

To  overcome  this  problem  Buri  &  Adclsoii  [18,  19] 
proposed  a  multiresolution  approach  for  merging  imag(\s 
First,  each  source  image  is  decomposed  into  a  set 
of  bandpass  filtered  component  images.  In  tin'  next 
step,  the  component  images  are  merged  separately  for 
each  band  to  form  mosaic  images  by  weighted  averaging 
within  a  transition  zone.  The  transition  width  corre¬ 
sponds  approximately  to  a  half  wave  length  of  the  band 
center  frequency.  Finally,  these  bandpass  mosaic  images 
are  simply  summed  to  obtain  the  desired  com[)osite  im¬ 
age.  Thus,  the  transition  zone  always  matches  the  siz('  of 
the  image  features.  This  technique  has  been  formulated 
for  pairs  of  static  source  images  and  demonstrated  to 
yield  superior  results  over  simpler  techniques  in  several 
applications  [18,  19]. 

We  adopt  this  idea  and  give  a  more  general  formula¬ 
tion  that  applies  to  any  finite  number  of  source  images 
and  to  time  sequences  of  images.  The  subregions  of  face 
images  will  be  called  patches  of  interest  (POI).  As  oji- 
posed  to  rectangular  regions  of  interest  (ROI)  a  POI  may 
have  arbitrary  shape.  Moreover,  several  POIs  may  over¬ 
lap  and  can  be  arranged  in  a  stack.  In  this  pseudo  3-1) 
structure  only  the  toj)  patches  contribut*'  to  the  compos¬ 
ite  image. 

Our  blending  algorithm  takes  two  kinds  of  inputs. 
Fi  rstly,  an  indexed  set  of  subimages  compatibh'  with  the 
corresponding  POIs.  These  subregions  of  facial  exam¬ 
ple  images  are  previously  normalized  in  i>ose.  Tlu'se 
examples  comprise,  among  others,  base  images  of  the 
face  under  various  off-plane  rotation  views  and  a  variety 
of  subimages  of  facial  details  like  different  mouth  shapes 
and  different  states  of  eye  movement  and  eye  blinks.  Sec¬ 
ondly,  a  sequence  of  index  images  that  describe  the  com¬ 
posite  face  image  aitpearance  over  time  —  thus  both  have 

^Amnon  Shashiia  provided  valuable  contributions  to  our 
discussion  and  the  relevant  papers. 
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the  same  size.  Each  pixel  value  in  these  index  images 
refers  to  the  POI  that  should  dominate  the  composite 
image  at  the  corresponding  position. 

The  procedure  consists  of  the  following  steps: 

1.  For  each  example  subimage  associated  with  the 
POI  having  the  index  number  i  generate  a  Lapla- 
cian  pyramid  LEi  consisting  of  bandpass  filtered 
images  LE\  using  the  procedure  described  in  Sec¬ 
tion  4.1.1®.  This  has  to  be  done  only  initially  and 
the  pyramids  can  be  stored  for  fast  access.  The 
storage  requirement  is  only  4/3  of  the  original  im¬ 
age. 

2.  For  each  index  image  X„  in  the  time  sequence  col¬ 
lect  the  set  of  index  numbers  Af  of  all  referenced 
example  subimages.  Simultaneously,  build  a  binary 
mask  image  Mi  for  each  index  number.  A  pixel  is 
assigned  the  value  1  at  positions  having  the  cor¬ 
responding  index  value  in  the  index  image  and  0 
everywhere  else. 

3.  Generate  a  Gaussian  pyramid  GMi  for  each  mask 
image  Mi  included  in  the  index  set  Af. 

4.  The  entries  in  the  GMi  pyramid  are  used  as  weights 
for  the  corresponding  bandpass  filtered  example 
subimages  in  LE].  For  each  band  (level  I  of  the 
pyramid)  a  mosaic  LC‘  image  is  computed  in  the 
following  way: 

LC'  =  Y.GMI-  LEl 

where  the  sum  is  taken  only  over  the  examples  in¬ 
cluded  in  the  index  set  Af.  This  significantly  in¬ 
creases  efficiency  if  a  large  number  of  potential  ex¬ 
ample  subimages  is  used  as  required  for  realistic 
animation. 

5.  Finally,  the  procedure  of  generating  the  Laplacian 
pyramids  is  reversed  to  obtain  the  composite  image 
G.  This  is  achieved  by  the  following  iterative  pro¬ 
cedure  starting  the  highest  level  N  of  the  mosaic 
pyramid  LC: 

GC^~^  =  LC‘~^  +  expand  GC\ 
where  GC^  =  LC^  and  C  =  GC°. 

Since  the  POIs  associated  with  each  index  image  may 
change  between  frames,  it  is  desirable  to  have  a  smooth 
transition  between  successive  frames  in  an  animation. 
An  adequate  way  to  accomplish  this  is  to  apply  a  low- 
pass  filter  to  the  binary  mask  image  M,  before  generat¬ 
ing  the  Gaussian  pyramid  GMi.  Weighting  the  past  few 
mask  images  with  an  exponentially  decaying  weighting 
function  can  be  implemented  very  efficiently  in  a  recur¬ 
sive  way  (see  [60]  for  algorithm).  However,  application 
of  such  a  filter  may  require  an  additional  normalization 
step  of  the  pixel  values  in  the  mosaic  images.  This  is 
because  in  general  it  cannot  be  guaranteed  that  the  sum 
of  all  weights  for  each  image  location  is  equal  to  unity. 

The  above  sketched  algorithm  has  been  implemented 
and  a  very  promising,  realistic  animation  of  face  images 

®Indication  of  the  individual  pyramid  level  I  is  omitted  if 
procedures  are  applied  homogeneously  to  all  levels. 


have  been  obtained.  This  approach  to  blending  images 
could  also  be  successfully  combined  with  texture  map¬ 
ping  techniques. 

4.4.2  Recovery  of  original  pose 

The  last  processing  step  before  displaying  the  recon¬ 
structed  face  image  is  to  transform  the  image  reassem¬ 
bled  from  normalized  examples  to  the  pose  of  the  original 
input  image.  This  requires  reversal  of  the  transformation 
of  the  pose  compensation  performed  on  the  sender  side 
from  the  transmitted  pose  parameters.  Inverting  the 
mapping  of  the  image  warping  (that  is  computing  the 
mapping  from  image  2  to  image  1  if  the  mapping  from  1 
to  2  is  given)  in  not  trivial  in  general  [64].  However,  due 
to  the  parametric  model  applied  here,  the  parameters  for 
warping  the  normalized  pose  face  image  to  the  original 
pose  can  be  computed  easily  from  the  transmitted  pose 
parameters  describing  the  alignment  with  the  reference 
image. 

In  Appendix  B  a  closed-form  solution  is  derived  to 
obtain  the  inverse  mapping  parameters  from  the  original 
parameters.  To  demonstrate  the  inverse  mapping,  in 
Figure  13  the  reference  image  depicted  in  Figure  5  is 
warped  towards  the  pose  of  the  right  image;  this  is  the 
reverse  of  the  pose  compensation. 

Figure  14  summarizes  the  processing  steps  of  a  sim¬ 
plistic  video-conference  system: 

•  normalizing  the  pose  of  a  new  face  image, 

•  finding  the  most  similar  example  out  of  a  database 
of  normalized  images, 

•  reconstructing  the  face  image  given  an  index  num¬ 
ber  (in  the  database)  and  inversion  of  the  pose  nor¬ 
malization. 

4.4.3  Interpolation  between  examples 

In  this  section  we  point  out  possible  extensions  of  the 
system  architecture  outlined  in  Section  4.  Instead  of 
using  the  nearest  neighbors  only,  it  is  natural  to  “inter¬ 
polate”  novel  views  between  examples  from  the  database 
as  already  mentioned  in  Section  3.1.  Extending  previous 
results  [45,  47,  48,  46],  recent  work  of  Beymer,  Shashua 
&  Poggio  [10]  presents  the  mathematical  formulation  and 
experimental  demonstrations  of  several  versions  of  such 
an  approach.  The  feasibility  of  interpolation  between 
images  has  been  successfully  demonstrated  for  the  mul¬ 
tidimensional  interpolation  of  novel  human  face  images. 

So  far,  the  examples  are  selected  manually  for  train¬ 
ing.  For  applications  in  video-conferencing  the  process  of 
picking  adequate  examples  from  the  database  for  subse¬ 
quent  interpolation  obviously  has  to  be  automated;  this 
becomes  an  even  more  significant  issue  for  higher  dimen¬ 
sional  interpolation.  Various  strategies  are  conceivable 
and  we  will  suggest  some  —  still  subject  to  experimental 
evaluation.  For  ID  interpolation  (morphing  between  a 
pair  of  similar  images)  an  exhaustive  search  for  the  two 
database  images  that  allow  for  the  best  interpolation  re¬ 
sult,  i.e.,  the  result  that  comes  closest  to  the  novel  image, 
appears  reasonable.  But,  for  higher  dimensional  inter¬ 
polation  (at  least  four  examples  are  needed  for  2-D  in¬ 
terpolation)  this  strategy  seems  to  be  prohibitive  due  to 
excessive  computational  costs  (combinatorial  explosion). 
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In  order  to  reduce  the  searcli  spare  for  the  interpola¬ 
tion  basis,  i.e.,  the  examples  tliat  span  the  space  of  pos¬ 
sible  interpolated  views,  we  suggest  the  following  strat¬ 
egy:  i)  Find  the  nearest  neighbor  to  the  novel  view  in  the 
database  (cf.  Section  4.3).  ii)  Restrict  the  search  spare 
to  images  that  are  within  a  certain  “distance”  from  the 
novel  view  or  the  nearest  neighbor;  the  later  is  more  ef¬ 
ficient  since  the  distances  between  all  examples  can  be 
precomputed.  Of  course  this  presents  the  problem  of 
finding  a  suitable  metric  to  define  the  distance.  Poten¬ 
tial  candidates  are  provided  by  the  algorithms  described 
in  Sections  4.3.1  and  4.3.2.  However,  other  (e.g.,  fea¬ 
ture  based)  metrics  used  for  recognition  should  also  be 
considered,  iii)  The  maximum  distance  could  be  cho¬ 
sen  to  contain  only  the  number  of  examples  required  for 
interpolation  of  a  given  dimensionality.  Note  that  this 
approach  may  result  in  ambiguities  if  more  than  one  ex¬ 
ample  image  has  the  same  distance  —  this  cannot  be 
excluded  in  a  high-dimensional  space.  Alternatively,  one 
can  chose  the  maximal  distance  so  that  more  than  the 
required  examples  are  included  in  the  search  space.  Sub¬ 
sequently,  the  number  is  reduced  by  abandoning  redun¬ 
dant.  images,  for  instance  using  a  leave-one-out  strategy 
that,  keeps  only  the  examples  providing  the  best  result. 

The  argument  for  the  strategy  of  taking  the  nearest 
neighbor  as  a  interpolation  basis  is  not  obvious  and  re¬ 
quires  a  better  practical  understanding  of  the  interpola¬ 
tion  algorithms  (see  [10]).  The  algorithm  consists  of  two 
ma..ior  steps.  First,  correspondence  vector  fields  are  esti- 
mat.ecl,  which  capture  the  geometrical  relations  between 
the  novel  image  and  the  examples  in  the  best  possible 
way.  In  the  second  ste]'),  the  texture  information  is  pix- 
elwise  interpolated  between  the  intensity  values  in  the 
examples  referenced  by  the  correspondence  vector  fields. 
In  theory,  optimal  vector  field  interpolation  for  the  first 
stejr  in  general  cannot  be  obtained  using  just  the  nearest 
neighbors.  On  the  other  hand,  because  of  illumination 
effects,  occlusions  and  distortions  it  is  likely  that  the 
nearest,  neighbors  contain  the  most  similar  texture. 

The  inter|)olation  technique  has  been  applied  to  im¬ 
ages  of  whole  faces.  A  natural  extension  is  to  apply 
the  interpolation  technique  separately  to  patches  of  in¬ 
terest  (POI),  as  proposed  in  Section  3.5.  By  means  of 
the  generalized  blending  technique  described  in  Section 
4.4.1  new  views  can  be  comirosed.  We  expect  a  large 
potential  in  combining  these  two  concepts.  Here  are  two 
examples:  i)  The  location  of  the  iris  and  pupil  of  the  eye 
(as  it  changes  with  the  direction  of  gaze)  may  be  inter¬ 
polated  from  four  examples  or  even  from  two  examples  if 
we  disregard  the  minor  vertical  movements.  Additional 
example  sets  may  be  used  to  account  for  varying  pupil 
size,  ii)  Realistic  synthesis  of  eye  blinks  may  require  not 
many  more  than  two  examples. 

We  will  now  suggest  a  further  extension  of  the  multidi¬ 
mensional  interpolation  algorithm.  So  far,  the  same  co¬ 
efficients  are  u.sed  for  interpolating  an  approximated  ge¬ 
ometric  relation  (correspondence  vector  fields)  between 
the  examples  and  the  novel  image  as  well  as  for  the  pixel- 
wise  interpolat  ion  of  the  intensity  (texture)  information 
(see  Sections  4.1.  and  4.2  in  [10]).  These  coefficients  are 
estimated  to  yield  the  “best”  possible  approximation. 


e.g.,  in  the  least  square  sense,  for  the  correspondence 
vector  field  of  the  novel  image  with  res[)ect  to  the  ex- 
ample(s).  While  this  coupling  between  the  coefficients 
for  geometric  and  texture  interi>olation  makes  sense  for 
certain  applications  of  ID  interpolation,  e,g.,  for  frame 
rate  conversion  (see  [8]),  it  is  not  necessarily  the  best 
approach  for  more  general  cases. 

We  suggest  exploiting  the  freedom  of  adjusting  the 
coefficients  for  geometric  and  texture  interpolation  inde¬ 
pendently.  A  straightforward  way  to  do  this  is  to  esti¬ 
mate  the  optimal  coefficients  for  geometric  interpolation 
first  (as  before)  and  subsequently  to  optimize  a  second 
set  of  coefficients  for  texture  interpolation.  The  second 
step  ii.ses  the  previously  recovered  geometric  relations  to 
access  the  intensity  information  at  corresponding  loca¬ 
tions  in  the  example  images.  The  coefficients  for  the 
second  step  could  be  found  by  least  square  minimization 
applied  to  the  intensity  values,  for  example.  Whih'  only 
doubling  the  number  of  parameters,  we  expect  even  more 
realistic  “rendering"  of  novel  images,  especially  if  the  ex¬ 
ample  images  are  captured  under  small  variations  in  the 
illumination  (direction  and  intensity).  A  further  amend¬ 
ment  could  include  “virtual"  example  images  that  could 
partially  compensate  for  small  lighting  changes.  In  the 
simplest  case,  the  coefficients  for  a  completely  black  and 
a  white  image  could  be  used  to  adjust  the  average  bright¬ 
ness  of  the  synthesized  image.  More  elaborated  versions 
would  use  several  additional  “virtual"  examples  showing 
slowly  varying  intensity  in  different  directions.  Adjust¬ 
ing  the  corresponding  coefficients  will  to  some  extent 
simulate  changes  in  illumination  direction.  Of  course, 
this  is  not  correct  in  the  strict  physical  sense  that  would 
require  multiplication  of  the  surface  reflection  by  the  il¬ 
lumination  intensity,  where  both  may  be  functions  of  the 
relative  angles.  However,  for  small  variations  the  linear 
compensation  will  give  reasonable  results  at  a  very  low 
computational  cost . 

Two  recent  achievements  should  be  mentioned  in  the 
context  of  generating  new  views  from  a  small  number 
of  model  views  or  example  images.  Shashua  &  Tody 
[53]  showed  that  a  nominal  quadric  transformation  for 
all  image  points,  i.e.,  a  transformation  assuming  that  an 
object  surface  can  be  approximated  by  a  quadric,  can 
be  successfully  applied  to  register  two  fare  images.  All 
parameters  of  the  transformation  (the  quadric  and  the 
relative  camera  geometry)  can  be  recovered  from  only 
nine  corresponding  points  over  two  views  [53,  54].  Al¬ 
ternatively.  the  transformation  can  be  recovered  using 
only  four  corresponding  points  and  a  given  conic  in  one 
view  (encompassing  the  face  in  our  application)  [54]. 

This  algorithm  is  relevant  here  for  two  purposes. 
Firstly,  it  can  be  used  as  a  preprocessing  step  to  facil¬ 
itate  pixelwise  correspondence,  i.e.,  bringing  two  views 
into  closer  alignment.  This  stej)  is  essential  for  views 
that  are  too  different  to  be  directly  accessibh'  to  stan¬ 
dard  den.se  correspondence  algorithms:  a  small  number 
of  distinct  feature  points  can  easily  be  found  in  the  two 
initial  views.  Secondly,  the  transformation  according  to 
the  cpiadric  surface  model  is  described  by  a  few  param¬ 
eters  (at  most  17  are  needed).  In  a  video-conference 
system  only  these  parameters  need  to  be  transmitted  in 
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order  to  synthesize  novel  views  of  a  face  from  one  given 
example,  provided  that  the  views  are  not  to  different 
(self-occlusion,  etc.).  The  nominal  quadric  transforma¬ 
tion  is  significantly  more  general  than  simpler  transfor¬ 
mations  commonly  used  (e.g.,  affine,  or  transformation 
due  to  a  plane)  and  superior  registration  results  have 
been  obtained  with  face  images  (see  [53,  54]  for  exam¬ 
ples). 

The  second  achievement  is  a  generalization  of  the  “lin¬ 
ear  combination  of  views”  result  obtained  by  UUman 

6  Basri  [62]  that  relates  three  orthographic  views  of  a 
3-D  object  (ignoring  self-occlusion).  Recently,  Shashua 
[52,  51]  proved  that  the  image  coordinates  of  correspond¬ 
ing  points  over  any  three  perspective  views  (uncalibrated 
pinhole  camera)  of  a  3-D  object  are  related  by  a  pair  of 
trilinear  equations.  The  17  independent  coefficients  of 
this  trilinear  form  can  be  recovered  linearly  from  9  cor¬ 
responding  points  over  all  three  views.  Once  these  coeffi¬ 
cients  are  recovered  and  full  correspondence  between  the 
two  model  views  has  been  established,  the  correspond¬ 
ing  locations  in  the  novel  (third  view)  can  be  obtained 
uniquely  for  all  other  points. 

The  direct  approach  of  using  the  trilinear  result  to 
generate  new  views  has  several  theoretical  advantages 
over  classical  structure  from  motion  methods  as  well  as 
over  methods  to  recover  non-metric  structure  (see  [51] 
for  a  detailed  discussion).  Moreover,  processes  that  are 
known  to  be  unstable  in  the  presence  of  noise,  such  as 
recovering  the  epipolar  geometry,  are  avoided.  The  tri¬ 
linear  algorithm  proved  to  be  significantly  more  stable 
in  the  presence  of  errors  in  the  image  measurements. 
So  far,  the  trilinear  algorithm  has  been  evaluated  only 
in  computer  simulations  and  using  re-projection  of  iso¬ 
lated  points  in  real  imagery,  though  an  implementation 

to  transform  dense  images  is  planned  for  the  near  future 
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Although  [52,  51]  emphasis  is  given  to  the  task  of 
recognition  of  3-D  objects,  the  trilinear  method  may 
have  interesting  applications  in  an  example-based  video- 
conference  system.  Only  17  parameters  are  needed  to 
represent  a  new  view  with  respect  to  two  model  views 
—  as  opposed  to  only  one  model  view  for  the  nominal 
quadric  transformation.  The  two  model  views,  or  exam¬ 
ples  as  we  called  them  earlier,  are  available  on  the  sender 
and  the  receiver  side.  The  same  algorithm  for  achieving 
full  correspondence  between  these  reference  images  is  ap¬ 
plied  on  both  sides.  To  encode  a  third  view,  the  sender 
solves  for  the  17  parameters  by  using  many  points  to  in¬ 
crease  robustness;  this  can  be  done  using  a  least  squares 
approach.  Transmitted,  however,  are  only  the  17  param¬ 
eters  needed  for  reconstruction. 

5  Conclusions  and  outlook 

The  concept  of  an  alternative  approach  to  video- 
conferencing  that  is  sketched  in  the  first  part  of  this 
paper  appears  to  be  very  promising.  Several  algorithms 
have  been  presented  that  form  modules  in  a  system  ar¬ 
chitecture.  Each  of  these  modules  has  proved  to  be 
robust  under  realistic  conditions.  Much  further  work 

’^Amnon  Shashua,  personal  communication,  March  1994. 


on  integration  and  refinement  of  the  system  is  required. 
Once  a  more  elaborated  system  based  on  our  approach 
is  available,  it  will  be  interesting  to  compare  its  perfor¬ 
mance  with  state-of-the-art  systems  utilizing  traditional 
model-based  approaches. 

For  an  automatic  video-conference  system,  i.e.,  a  sys¬ 
tem  that  does  not  require  any  human  intervention,  ad¬ 
ditional  components  are  obviously  required.  However, 
many  suitable  algorithms  are  already  known  and  de¬ 
scribed  in  the  literature.  The  most  important  compo¬ 
nents  are  briefly  discussed  in  the  sequel. 

The  separation  of  the  face  and  the  uncovered  back¬ 
ground  can  be  achieved  by  the  methods  sketched  at  the 
end  of  Section  4.2.  Recently  an  algorithm  for  human  face 
segmentation  by  fitting  an  ellipse  to  the  head  has  been 
described  [56].  This  algorithm  is  robust  enough  to  deal 
with  images  having  moderately  cluttered  backgrounds. 

Another,  more  critical  problem  is  the  automatic  selec¬ 
tion  and  positioning  of  the  POIs  in  the  face  image.  This 
task  is  significantly  simplified  by  the  robust  pose  nor¬ 
malization  presented  in  this  paper.  In  the  normalized 
face  images  knowledge  about  the  average  facial  geome¬ 
try  is  easily  applicable  to  define  relevant  regions.  For 
individual  faces,  regions  of  high  surface  structure  can  be 
detected  by  means  of  texture  analysis  techniques  (e.g., 
high  gradient  or  high  spectral  energy).  The  generalized 
blending  algorithm  described  in  Section  4.4.1  to  some 
extent  smooths  out  visible  discontinuities  at  edges  be¬ 
tween  adjacent  POIs.  It  is  desirable,  however,  to  locate 
boundaries  within  regions  of  low  surface  texture  and  not 
at  conspicuous  facial  features  —  loosely  speaking,  we 
apply  a  reversed  edge  detector.  For  this  purpose  ori¬ 
entation  selective  filters  (like  Gabor  filters,  wavelets,  or 
steerable  filters  [27])  may  be  the  way  to  go.  They  make 
it  possible  to  seek  for  appropriate  locations  depending 
on  the  orientations  of  boundary  lines  that  are  approxi¬ 
mately  given. 

An  important  step  is  the  automatic  acquisition  of  the 
example  database.  Here  at  least  two  distinct  tasks  have 
to  be  distinguished.  During  the  initialization  phase  ex¬ 
amples  have  to  be  acquired  that  span  the  largest  possible 
range  of  facial  expressions  and  poses.  However,  extreme 
poses  and  expressions  may  be  disregarded  at  this  stage. 
Moreover,  two  cases  have  to  be  considered.  In  the  stan¬ 
dard  case  no  prior  examples  for  a  person  are  available 
when  a  person  uses  the  system  for  the  first  time.  Then 
all  examples  have  to  be  transmitted  initially  using  con¬ 
ventional  image  compression,  e.g.,  JPEG.  In  the  second 
case  prior  examples  are  available  from  previous  sessions. 
New  examples  have  to  be  acquired  on  the  sender  side 
and  it  has  to  be  determined  which  of  the  old  examples 
are  still  compatible  with  the  current  situation.  This  is 
necessary  to  update  changes  in  facial  hair,  for  example. 
The  gain  is  that  usually  only  some  new  examples  may 
have  to  be  transmitted  to  the  receiver  side.  However, 
the  computational  cost  of  the  evaluation  on  the  sender 
side  may  be  higher. 

During  the  subsequent  transmission  phase  of  the  ses¬ 
sion  the  task  is  somewhat  different.  In  general,  novel 
images  should  be  approximated  in  terms  of  the  examples 
with  sufficient  accuracy.  Precautions  should  be  taken  to 
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detect  whenever  tlie  best  possible  reconstruction  (based 
on  the  availalde  examples)  is  not  satisfactory.  In  thc.se 
rare  cases,  e.g.,  for  unusually  strong  expre.ssions,  (com¬ 
pressed)  new  image  data  have  to  be  transmitted.  At  this 
point  it  is  not  clear  whether  this  additional  image  data 
for  non-standard  cases  should  supplement  the  standard 
database  as  an  additional  example:  this  has  to  be  sub¬ 
jected  to  experimental  evaluation. 
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A  Hierarchical  estimation  of  global  pose  from  local  displacements 

We  assume  an  affine  model  for  the  displacement  vector  field.  The  affine  displacement  field  d(xj)  is  determined  at 
any  image  location  x*  =  (s;,  by  six  model  parameters: 

d(xi)  =  Axi+t  (11) 


with 


^=[1:  cl )  )■ 

Suppose  we  have  n  image  points  x,-  G  (*  =  1,  •  •  • ,  n)  with  a  measured  displacement  d(xi)  =  {dx{xi,yi),  dr{xi,yi)y  . 

We  obtain  an  overdetermined  linear  system  with  2n  equations  and  six  unknowns  that  can  be  written  in  matrix 
form: 


Mp  =  d 


(13) 


with 


/I  xx  yi 


M  = 
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Xi  yi 
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Vn  / 
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2nx6 
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G  R®  and  d : 
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V  ^y  j  J/n)  ) 


GR^”.  (14) 


In  general  this  system  cannot  be  solved  exactly.  Instead,  we  want  to  find  the  parameter  vector^p  that  minimizes 
the  error  between  the  measured  displacement  field  d(xi)  and  the  fitted  affine  displacement  field  d(xi).  We  assume 
Gaussian  statistics  of  the  process  and  use  the  Lo-norm  as  a  distance  measure.  So,  we  want  to  find 


min  e  =  ^  wf  d(xj)  -  d(xi) 


(15) 


where  we  allow  for  a  weight  wj  for  each  data  point  x,:.  These  weights  may  account  for  the  confidence  that  we 
associate  with  the  measurement  d(xi). 

With 
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wi  ■  •  •  0 
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the  solution  of  the  minimization  problem  (15)  formally  is 

p  =  M*d, 


G  R 


2nx2n 


(16) 
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(17) 


where  M*  =  (M^W^M)”^M^W^  is  the  weighted  psendo-inverse.  This  can  be  written  as 

M'^w2Mp  =  M'^wM  =»  Bp  =  b  (18) 

B  b 

Now,  b  G  R®  and  B  G  R®^®  is  a  square  matrix.  The  system  can  be  solved  for  the  parameter  vector  p  =  B^^b 
provided  that  det(B)  0  and  therefore  the  inverse  B“^  exists.  For  reasons  of  numerical  accuracy  and  stability  one 
would  generally  prefer  to  solve  the  overdetermined  system  by  means  of  the  computationally  more  costly  singular 
value  decomposition  (SVD)  (cf.  [49]).  However,  our  relatively  simple  model  is  well-behaved  and  it  turn  out  that  in 
the  implemented  case  the  matrix  inversion  is  trivial. 
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A.l  Gniioral  case 

In  t.lio  general  case  of  an  affine  displacement  field  with  six  free  parameters  we  have  p,3  =  Bg 'be  with 
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(20) 


A. 2  No  shear 

If  we  admit  only  translation,  scale  and  rotation  and  do  not  allow  for  any  component  of  shear,  A  in  (12)  takes  the 
form 

A  =  SR-I 

with 


R 


COSO  sine 
—  sin  a  COSO 


and  I  = 


1  0 
0  1 


(21) 

(22) 


s  0 
.  0  s 

Consequently  (11)  becomes 

d(x,)  =  (SR.  —  I)x,  +  t. 

Therefore,  we  have  the  constraint  for  the  parameters: 

bj.  =  Cy  =  s  ■  COS  o  —  1  and  —  (>y  =  O-  =  s  •  sin  o . 

The  scale  .s  and  the  angle  of  rotation  a  in  the  image  plane  are  derived  from  A  as 

f>r_+  1 

Of  course,  s  =  1  and  o  =  0  yields  a  constant  displacement  field. 

With  the  constraints  in  (24),  the  linear  system  (18)  can  he  simplified  by  adding  the  2nd  and  6th  row  and  subtracting 
the  5th  from  the  3rd  row  of  (19)  and  (20).  This  gives  p.)  =  Bj’b.!  with 
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A. 3  Pure  translation 

If  we  admit,  only  pure  translation  {Cy  =  bj-  =  by 
this  case  we  have  Bopi'  =  b^,  where 

E«7  0 


0)  the  displacement  field  is  of  course  constant  d(x,  )  =  t.  In 
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are  derived  from  the  1st  and  4lli  row  of  the  general  case  equation  system  (19)  and  (20).  Since  B^  =  E 
solve  B2P2  =  1)2  directly: 

P2  =  ^F=^b2;  (29) 


which  is  equivalent  to 
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A. 4  New  reference  frame 

As  we  will  see,  several  expressions  can  be  simplified  significantly  if  we  express  the  affine  displacement  field  in  a  new 
frame  of  reference.  In  order  to  do  so,  we  introduce  a  new  function 


d'(x-)  =  A'xJ  +  t'  =  Ax;  +  t  =  d(xj)  (31) 

expressing  the  affine  displacement  field  in  terms  of  A'  and  t'.  We  performed  a  coordinate  transform,  so  that  the 
origin  of  the  new  reference  frame  coincides  with  the  center  of  gravity  of  the  image: 


X 


/ 
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Xi-X 


with 


-  E 


(32) 


In  the  new  reference  frame  the  equation  systems  take  a  much  simpler  form,  since 


-  X ^  w?  =  0. 

i  i  i 

Using  (31)  and  (15)  we  now  want  to  minimize  the  expression 

e  =  ||d(x0  -d'(x')| 


(33) 


(34) 


with  respect  to  the  new  parameters  and  a'y,b'y,c'y. 

A. 4.1  General  case 

From  (31)  and  (32)  we  can  directly  derive  the  old  parameters  from  the  new  ones: 

A'x(  +  t'  =  A'(xi  -  x)  +  t'  =  A'xj  +  (t'  -  A'x)  =  Ax,:  +  t 
Comparison  of  the  coefficients  on  both  sides  yields  (and  similarily  for  ay,by,Cy): 

Mr  =  a'^  -  b'^x  -  c'^y  ,  br  —  b'^  and  Cr=c'^. 

Because  of  (33),  in  the  new  reference  frame  (19)  and  (20)  simplify  to 
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and  by  inversion  the  remaining  2x2  matrix  we  get  the  solutions  for  6^,  c!^  in  closed  form: 
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The  solutions  for  a'y,b'y,  c'y  are  found  similarly. 
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A. 4. 2  No  shear 

In  tlio  now  reference  frame  (26)  and  (27)  simplify  to 
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A. 5  Homogeneous  error  of  the  fit 

In  order  to  assess  how  well  the  estimated  affine  motion  d'(x()  describes  the  measured  displacement  field  cl(x,)  let  us 
now  derive  the  homogeneous  error  of  the  fit.  Setting  all  weights  ivj  —  1  we  obtain  from  (34): 


=  X!  ||d(x,  )  -  d'(x')  =  ^(d(x,  )  -  (A'x'  +  t'))' 

i  ~  i 

Note,  that  because  of  the  io-norm,  e  can  be  decomposed  into  tlie  errors  of  the  x  and  y  component: 


h(  =  hcj-  +  he 
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The  mean  deviation  of  the  measured  displacements  from  the  fitted  affine  displacement  field  is  estimated  by  the 
homogeneous  variance 

he 
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In  the  general  case  we  obtain  for  the  error  of  the  x  component: 
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and  similarly  for  hey,  where  n  is  the  number  of  data  points. 

A. 6  Weighted  eri’or  of  the  fit 

The  mean  deviation  of  the  measured  displacements  d(x,)  from  the  fitted  affine  displacement  field  d'(x')  is  estimated 
by  the  weighted  variance  approximated  bv 

ire 

var„  Ri  =-^.  51) 

E  «'7 

From  (34)  we  get  the  weighted  error  of  the  fit: 


d(x,)-d'(x')  =  (d(x,)  -  (A'x' +  t'))E 


(52) 


Note,  that  we  can  be  decomposed  again  into  the  errors  of  the  x  and  y  components: 
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A. 6.1  General  case 

In  the  general  case  we  get  for  the  error  of  the  x  component: 
we^  =  y^^wfdl{xi,yi) 
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and  similarly  for  wCy.  Since  '^wfx'^  =  Ylwfy'^  =  0  and  with  the  solution  for  a'^  several  terms  cancel  out  and  we  get 
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A. 6. 2  No  shear 

With  Cy  =  bx  and  by  =  —Cx  we  get  from  (53)  and  (55) 
we  =  ^44(a;i,2/i) + 

(56) 
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This  can  be  further  simplified  using  solutions  for  6;,c;: 

we  =  Y  +  Y  ^i<^y(^i^yi) 

(57) 
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1- 

A. 6. 3  Pure  translation 

In  the  case  of  pure  translation  (cy  =  6j,  =  6„  =  c*  =  0)  we  see  from  (36)  that  Ux  =  «;  and  ay  =  a'  The  error  of  the 

fit  is  then  given  by 

We  =  Y  yi)  +  X  ^i^l(^i>yi)  -  (4  +  4)  X  ' 

(58) 

A. 7  Propagation  of  affine  parameters  from  coarse  to  fine  levels 

In  the  sequel  we  derive,  how  to  propagate  the  affine  motion  parameters  from  a  coarse  pyramid  level  to  the  next  finer 
level.  The  affine  displacement  field 

d'(x')  =  A'x- +  t'  (59) 

at  the  level  I  of  the  pyramid  is  determined  at  any  image  location  x-  =  {x\,y\Y  by  six  parameters: 


4  4 
4  4 


and  t*  = 


(60) 


For  the  previous  coarser  level  /  +  1  with  lower  resolution  we  have  accordingly 

d^+^(x'+^)  =  A'+^x|+^  +  t'+^  (61) 

Now,  since  the  sampling  grid  of  level  /  has  twice  the  density  than  at  the  coarser  level  /  +  1,  the  coordinates  of 
corresponding  points  and  their  displacements  are  related  by 

x'.  =  2-xHi  d'(x'.)  =  2-d'+4x'+i).  (62) 

Inserting  this  into  (59)  leads  to 

d'+i(x'+4  =  A'x|+i  +  ^t'.  (63) 

Comparing  this  with  (61)  yields 

A'  =  A'+^  and  t' =  2  ■  t'+^  (64) 

Therefore,  to  propagate  the  affine  parameters  from  level  /  +  1  to  the  next  finer  level  I  we  have  only  to  double  the 
translation  vector  The  coefficients  of  matrix  A^+^  remain  unchanged. 
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A. 8  Combining  the  affine  parameters  at  one  level 

The  combination  of  the  initial  affine  parameters  on  level  /  (propagated  from  level  /  +  1)  with  the  parameters  of 
the  residual  affine  motion  on  level  /  (estimated  from  the  residual  OF  at  level  /)  is  now  considered.  The  refined 
disidacement  field  dj.  at  level  /  is  given  by 

d'=d'  +  d',  (65) 

where  d^(xl)  =  +  r  is  the  residual  affine  displacement  field  estimated  at  level  /.  With  (59)  we  get  because  of 

the  linearity  of  the  affine  transformation: 


dj,(xj)  —  (A^xJ-  +  t^)  +  (A^x|  +  t^)  —  (A^  +  A^)x|'  -f  (t^  +  t^)  —  AJ.x 


'  +  <■ 


(66) 


Therefore,  the  refined  affine  parameters  ( Aj.,  tj.  )  at  level  /  are  given  as  the  sum  of  the  propagated  parameters  (A^  t^) 
and  the  residual  parameters  (A^r)  estimated  at  level  /. 


B  Inversion  of  affine  pose  transformation 

The  transformation  defined  in  (11)  and  (12)  gives  ns  the  displacements  from  the  example  image  E  to  the  new  image 
I.  This  allows  us  to  warp  the  face  in  I  towards  the  normalized  pose  of  the  example  image  E.  The  jiosition  of  a  point 
x'  =  (.tJ,  j/J)^  in  /  is  derived  from  the  affine  displacement  field  d(x,)  that  maps  x,  —  x'  by 


x'  =  d(x;)  +  X,  =  (A  +  I)x,  +  t. 


(67) 


Here,  the  prime  indicates  that  we  use  image  7  as  a  reference  frame. 

On  the  receiver  side  we  are  faced  with  the  problem  of  reversing  this  pose  normalization.  We  have  a  face  image  E 
with  normalized  pose  (indexed  in  the  database,  or  interpolated  between  several  examples)  and  we  want  to  generate 
an  image  7  according  to  the  transmitted  pose  parameters  of  our  model.  We  are  now  looking  for  the  displacement 
field  d'(x')  that  maps  x'  —  x,.  For  our  affine  transformation  model  there  is  a  closed  form  solution.  Provich'd  that 
det(A  + 1)  ^  0  the  inverse  matrix  exists,  and  we  obtain  from  (67)  the  location  x,  in  the  normalized  image  E  by 


X,  =(A+I)-‘x'-(A  +  I)-4. 

With  the  definition 

d'(x')  =  A'x'  +  t'  =  X,  -  x' 

we  conclude  from  (68)  that 

A' =  (A +  !)-'-!  and  t' = -(A  +  I)-‘t. 
Therefore,  expressed  in  terms  of  the  model  parameters.  A'  is  given  by 


A' 


Cy  +  l  Cj-  o') 

-by  \  )  [O  1  ]■ 


(08) 


(69) 

(70) 

(71) 


det(A  +  I) 


Figure  3;  Depicted  are  several  facial  expressions  that  may  occur  during  a  video-conferencing  session.  All  images 
exhibit  a  roughly  frontal  view  of  the  face.  Conventional  approaches  to  normalize  face  images  utilize  labeled  feature 
points,  like  the  pupils  of  the  eyes  or  the  tips  of  the  mouth.  The  first  image  shows  a  neutral  facial  expression  with  gaze 
in  the  forward  direction.  This  is  the  standard  case  that  most  face  recognition  systems  are  designed  to  cope  with.  The 
second  row  demonstrates  the  movement  of  both  eyes  due  to  changes  in  direction  of  gaze  (conjugate  eye-movements); 
vergence  movements  (disjunctive  eye-movements)  alter  the  distance  between  the  pupils.  The  positions  of  these  points 
(centers  of  the  pupils)  may  differ  by  more  than  1  cm  on  either  side  of  the  forward  direction.  This  is  a  large  fraction 
of  the  inter-ocular  distance  of  about  7  cm.  The  last  row  depicts  the  movement  of  the  corners  of  the  mouth  due  to 
skin  deformations  caused  by  facial  expressions  and  normal  speech.  As  is  evident,  estimating  the  pose  based  on  the 
correspondence  of  these  points  is  rather  unreliable  if  facial  expressions  are  admissible.  Nevertheless,  such  feature 
points  are  commonly  used  for  normalizing  face  images  with  moderate  deviations  from  neutral  expressions.  Finally, 
the  pupils  may  entirely  disappear  when  the  eyelid  is  closed  during  twinkling  or  blinking.  A  pose  estimation  method 
relying  on  the  correct  detection  of  these  feature  points  would  be  led  astray. 
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Figure  4:  Band-pass  filtered  images  of  a  face  (in  front  of  a  dark  background).  A  DifTf'ccnce-of-Gaussians  (DOG) 
filler  approximating  a  Laplacian-of-Gaussian  operator  of  width  a  ( “Mexican-hat  filter")  is  applied  to  the  original 
image.  The  images  are  arranged  in  descending  order  of  a  starting  with  (t  =  64.0  pixels  for  the  upper  left  imagi'  and 
ending  wit.h  a  =  2.0  pixels  for  the  lower  right  image;  a  increases  by  one  octave  between  consecutive  images.  Facial 
expressions  and  details  of  the  face  (eyes,  mouth,  etc.)  are  most  conspicuous  in  the  high-frequency  images,  whereas 
the  overall  position,  size,  and  orientation  of  the  head  api>ear  dominant  in  the  low-frequency  bands. 
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Figure  5:  Two  face  images  with  similar  expressions,  but  different  pose  and  size,  used  for  demonstrating  the  robust 
pose  estimation  and  compensation  algorithm.  The  left  image  is  the  reference  image,  e.g.,  stored  in  the  example 
database,  the  right  one  is  a  new  frame  of  the  incoming  data  stream  that  has  to  be  normalized  in  pose.  All  images 
used  in  the  sequel  are  255  x  320  pixels  in  size  and  digitized  with  8-bit  quantization.  The  focal  length  of  the  camera 
was  approximately  16  mm  and  the  camera  distance  was  about  1.2  m.  The  face  on  the  right  side  is  inclined  by  8-9° 
and  is  about  10%  smaller  than  that  in  the  reference  image.  These  values  are  obtained  by  direct  reading  from  the 
images. 


Figure  6:  Here  the  right  image  of  Figure  5  is  transformed  to  resemble  the  pose  of  the  reference  image.  Pose 
parameters  have  been  computed  automatically  by  the  algorithm  described  in  Section  4.1.  The  six  images  show  how 
the  results  depend  on  the  resolntion  level  where  the  estimation  is  terminated.  Subsequently,  the  pose  parameters  are 
extrapolated  to  the  original  resolution  and  the  images  are  warped  accordingly.  The  lowest  resolution  is  level  5  (upper 
left);  the  original  image  resolution  is  at  level  0  (lower  right).  By  visual  inspection  it  is  obvious  that  level  3  (upper 
right)  already  achieves  good  alignment  of  the  face  with  the  reference  image.  The  resulting  image  becomes  stationary 
and  estimation  at  higher  resolutions  does  not  lead  to  significant  improvement.  For  more  quantitative  details  see 
Figure  7  and  Table  2. 
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Figure  7:  Oj. :  O  :  +  s :  O  q  :  x 

Different  grapliic  representations  of  the  estimated  pose  parameters  given  in  Table  2.  Investigated  is  the  dependence 
of  the  accuracy  on  tlie  pyramid  level  where  the  estimation  is  terminated.  The  diagrams  show:  a)  pose  parameters 
at  each  final  level  normalized  by  their  bottom  values  (level  0):  b)  error  to  highest  resolution  estimate  normalized 
by  the  botl.om  vahie;  c)  relative  change  between  succe.ssive  levels  normalized  by  the  value  of  the  lower  current  level; 
d)  relative  change  between  successive  levels  normalized  by  the  bottom  value.  The  notation  is  the  same  as  is  used 
throughout  the  text:  Oj.  and  ciy  are  the  horizontal  and  vertical  translation,  respectively;  s  is  the  scale  factor;  o  is 
the  angle  of  in-plane  rotation.  The  more  careful  analysis  confirms  the  results  already  discussed  in  Figure  6.  The 
estimates  converge  to  the  values  at  the  highest  re.solution.  At  level  2  the  estimated  parameters  are  already  very  close 
to  the  values  obtained  at  the  highest  re.solution  (level  0).  The  largest  changes  in  the  parameters  hapiien  already  at 
the  high  pyramid  levels  with  low  resolutions  (see  lower  row). 


Table  2:  Pose  parameters  estimated  by  the  algorithm.  The  values  in  each  row  correspond  to  the  images  depicted 
in  Figure  6,  where  I  is  the  final  level  for  estimation  before  the  parameters  arc  propagated  to  the  original  resolution 
at  level  0.  The  notation  for  the  parameters  is  the  same  as  throughout  the  text:  Uj.  and  ciy  are  the  horizontal  and 
vertical  translation,  respectively;  s  is  the  scale  factor;  o  is  the  angle  of  in-plane  rotation  in  radians.  The  last  two 
columns  give  the  weight, ed  variance  (var„  )  and  the  homogeneous  variance  (var/,)  as  discussed  in  Section  4.3. 
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Table  3:  Pose  parameters  at  the  bottom  level  (/  =  0)  depending  on  the  number  of  iterations  i  per  level  estimated  for 
the  image  pair  in  Figure  5.  The  estimates  vary  very  little  for  increasing  number  of  iterations.  The  largest  changes 
(for  tty  and  a  between  i  —  \  and  2)  are  on  the  order  of  5%.  var«,  and  var^  are  the  weighted  and  the  homogeneous 
variance,  respectively. 


Figure  8:  Two  other  face  images.  This  example  is  somewhat  more  “difficult”  as  compared  to  Figure  5,  since  the  facial 
expressions  in  both  images  are  quite  different.  Again,  the  left  image  represents  the  reference  pose.  An  inclination  of 
6-7°  and  an  increase  in  size  of  about  7—8%  can  be  measured  directly  by  comparing  both  images. 


Figure  9;  See  caption  of  Figure  6.  The  results  in  this  figure  are  for  the  two  images  in  Figure  8.  By  visual  inspection 
no  significant  change  occurs  for  final  estimation  levels  higher  than  two  (lower  left). 
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Figure  10:  :  O  Oj, :  +  s:  □  o  :  x 

DifTcrent  grapliic  representations  of  the  estimated  pose  parameters  given  in  Tab!*'  4.  See  also  caption  of  Figure  7 
Again,  tlie  parameters  converge  to  tlie  values  at  higliest  resolution  (upper  row)  and  tlie  largest  changes  take  placi 
at,  the  high  pyramid  levels  representing  low  frequencies  (lower  row).  However,  this  tendency  is  not  as  pronounced  a: 
in  Figure  7. 


Table  4:  Pose  parameters  for  the  resulting  images  in  Figure  9.  See  also  caption  of  Table  2. 
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Figure  11:  a®  :  ^  Oy  :  +  s :  tr :  x 

More  detailed  analysis  of  the  parameters  depending  on  the  number  of  iterations  (1,  2,  3,  5,  respectively)  per  level 
before  the  estimation  is  continued  at  the  next  higher  resolution  level.  The  values  are  normalized  by  the  bottom 
value.  Table  5  gives  the  corresponding  absolute  parameter  values.  Diagram  a)  is  identical  to  Figure  10  a).  The 
major  improvement  is  achieved  by  performing  two  iterations  instead  of  one  pass  since  the  convergence  to  the  bottom 
values  becomes  smoother. 


Table  5:  Estimated  pose  parameters  at  the  bottom  level  (I  =  0)  as  functions  of  the  number  of  iterations  i  per  level. 
The  values  correspond  to  the  left  data  points  in  Figure  11.  Note  the  significant  change  in  Uy  and  a  between  i  =  1 
and  2,  whereas  the  variation  for  larger  number  of  iterations  is  fairly  small. 
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Figure  12:  Illustration  of  the  results  depending  on  the  number  of  iterations  per  level:  left  /  =  1;  niiddh'  i  =  3;  right 
i  =  10.  Careful  visual  inspection  is  required  to  perceive  the  differences  although  the  values  in  Tahh'  5  differ. 


Figure  13:  The  image  transformation  can  be  inverted  due  to  the  parametric  model  used  here.  Given  the  pose 
])araiiieters  used  to  align  the  new  image  with  the  reference  image,  the  inverse  parameters  can  be  computed  as 
described  in  the  text.  Note  that  this  is  not  possible  for  a  general  mapping.  For  illustration,  the  reference  image  in 
Figure  8  is  transformed  to  the  pose  of  the  new  image,  i.e,,  tlie  mapping  is  done  in  the  reverse  direction. 


Figure  14:  Seqtieuce  of  images  to  exemplify  the  processing  steps  of  a  simple  video-conference  system.  The  left 
image  is  a  newly  acquired  video  frame.  The  middle  image  is  the  most  similar  normalized  exami)le  imag('  found 
in  the  database.  The  jiose  parameters  of  the  new  image  witli  respect  to  the  reference  example  are  estimated  and 
transmitted  together  with  the  index  number  for  the  example.  On  the  receiver  side  the  stored  normalized  example  is 
transformed  towards  the  pose  of  the  new  image  on  the  sender  side  (right  image). 
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Figure  15:  Several  images  to  evaluate  and  demonstrate  the  robustness  of  the  algorithm  and  the  range  of  poses  it 
can  deal  with.  The  horizontal  and  vertical  translation  in  the  first  two  rows  amount  to  about  110  and  50  pixels, 
respectively.  In  the  third  row  an  in-plane  rotation  of  15-20®  to  each  side  (measured  from  the  vertical  line)  can  be 
read  from  the  images.  Finally,  the  variation  in  size  in  the  last  row  is  in  the  range  of  25%.  Other  examples  with 
different  backgrounds  and  a  variety  of  distinct  poses  and  facial  expressions  can  be  found  in  [58] . 
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Figure  16:  Pose  compensated  versions  of  the  original  images  depicted  in  Figure  15  are  in  corresponding  positions. 
The  image  in  tlie  first  row  is  the  reference  image  defining  the  intended  pose.  Tlie  rather  distinct  facial  expression  as 
compared  to  the  new  images  is  noteworthy.  These  results  are  obtained  with  only  one  iteration  (/  =  1)  jier  level  and 
the  estimation  is  terminated  at  level  two  (/  =  2).  The  reference  image  resembles  the  images  in  the  last  row  of  Figure 
3.  Conventional  algorithms  relying  on  localized  feature  points  would  very  likely  produce  unreliable  results. 


Figure  17:  Results  similar  to  Figure  16,  but  for  a  reference  image  with  one  eye  closed  and  with  severe  distortions 
around  the  mouth. 
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